S9S SB ftfif H' fir 1 '■WMmrnTn? LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 no.630-82>5> * X fad UIUCDCS-R-76-831 31 v\ a a f 1 CHARACTERIZATION OF A DISTRIBUTED DATA BASE SYSTEM BY ENRIQUE GRAPA October, 1976 m | ■ 3 :■ ! : s =§ . 2 '•#■/,' UIUCDCS-R-76-831 CHARACTERIZATION OF A DISTRIBUTED DATA BASE SYSTEM BY ENRIQUE GRAPA C ' ,c 1 C3 October 1976 Department of Computer Science University of Illinois Urbana, Illinois 61801 2» £3 This work was supported in part by the Department of Computer Science, the Center for Advanced Compuation and the Command and Control Technical Center and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, at the University of Illinois. Ill ACKNOWLEDGMENTS I am greatly indebted to Geneva Belford for her guidance and immea- surable patience during the preparation of this thesis. My sincere appreciation to Professor Dan Slotnick for his continuous encouragement and for his revision of the different stages of this work. I would like to express my appreciation to Professor Peter Alsberg for reviewing this thesis, and for making it possible by accepting me as a member of his research group. I am very grateful to the directorate of the Instituto de Investigacion en Matematicas Aplicadas y Sistemas (IIMAS) of the National University of Mexico (UNAM) for opening the doors to my graduate studies by commissioning me to this University. Dr. Renato Iturriaga's advice and encouragement during the early stages of my graduate studies is very deeply and warmly appreciated. My unrestricted thanks go to Pamela Hibdon for the excellent and care- ful typing job that she performed and to Greg Ives who made all the drawings. Last, but by no means least, I am very deeply indebted to my mother, friends and relatives for their patience and support, especially to my wife Rebeca and children Arie and Nurit whose mere existence and love fuels my motivation. ■ H \ • . m i 2 s ' 7 II iv TABLE OF CONTENTS Chapter Page I. INTRODUCTION 1 II. WHY IS A DISTRIBUTED DATA BASE SYSTEM USEFUL? 5 1. Introduction 5 2. Network economies 5 3. Availability 13 1. Introduction 13 2. Chu's work 14 3. The work of Belford et al 16 4. Discussion 19 4. Response time 20 III. EXISTING MODELS OF DISTRIBUTED DATA BASE SYSTEMS 22 1. Introduction 22 2. Johnson's model 25 1. Context 25 2. Synchronization mechanism 26 3. Operations available 27 4. Miscellanea 31 3. Bunch's model 31 1. Context 31 2. Synchronization mechanism 32 3. Operations available 34 4. Miscellanea 35 4. The Reservation Center model 36 1. Context 36 2. Synchronization mechanism 37 3. Operations available ^ c 2 - j : -a V Chapter Pa 8 e 4. Miscellanea 41 IV. EVALUATION OF THE MODELS 42 1. Generalities 42 2. Evaluation of Johnson's model 43 3. Evaluation of Bunch's model 46 4. Evaluation of the Reservation Center model 49 5. Comparing the models in a similar environment 52 1. Introduction 52 2. The least-common-denominator environment 53 3. How good are the models during normal operation? 54 1. How well do they maintain "real" update order? .... 54 2. How fast are they? 56 1. Local application delay 2. Non-local application delay 57 3. System's throughput 58 4. How good are they during failures? "1 1. Fatal failures 62 2. Non-fatal failures 66 6. Overall power of the models 1. Concepts and definitions 1. Notation and basic definitions 2. Operation power 3. Theoretical systems 2. Evaluation of the feasibility of the addition of operations to a non-primary model 1. Extension of operations in a non-primary model .... '9 2. Summary 91 - I ;, >■ :• B ■ & - 9 ■'!. vi Chapter Page 7. Conclusion 92 V. SOME EXTENSIONS TO THE MODELS 95 1. Clock synchronization for Johnson's model 95 1. General discussion 95 2. An algorithm for synchronizing the clocks 99 3. General remarks 103 4. Summary 105 2. Delay pipeline 106 3. Resiliency 110 1. Introduction 110 2. The broadcast model 113 3. Other models 120 4. Conclusion 120 VI. REVIVAL OF THE FILE ALLOCATION PROBLEM 126 VII. CONCLUSIONS 130 Appendix A FILE ALLOCATION 135 1. The problem 135 2. Casey's model and theorems 136 3. Other related work 139 4. How critical is the evaluation problem? 142 5. Our contribution 143 6. A program to search for an optimum in Casey's model 146 7. Restricted environment 150 8. Suboptimal search algorithm 153 9. Conclusion 157 < i |g .- \ 5 8 ■ s vii Appendix Page B PROOF OF CONDITIONS 1, 2, AND 3 OF APPENDIX A 159 C PROGRAM TO SEARCH THE COST TREE 163 D PROOF OF CASEY'S THEOREM FOR A PRIMARY SCHEME 166 LIST OF REFERENCES 168 VITA 171 ,C 3 > 5 ■■» • 2 ■ S Chapter I. INTRODUCTION The literature is full of papers describing the advantages of distri- buted data base systems. Studies to determine the location of different copies of such a data base are abundant as well. All of these are based upon models with unproven workability. Strangely enough, no distributed data base system is available and efforts to determine the building parameters of such a system have been feeble. The main goal of this thesis is to obtain a characterization of a workable distributed data base system in order to provide appropriate tools for a future implementation. The literature uses the term distributed data base system for dif- ferent concepts. It is thus necessary to clearly define the phrase. For the purpose of this thesis, a distributed data base system is characterized by: a) multiple copies of portions of the data base distributed among various sites of a computer network; b) geographic separation of the participating computer systems with a limited communication bandwidth between them; and c) co-equal roles for the systems involved (i.e., similar software at each computer rather than a single piece of software distributed throughout the network) . >e C 3 c2 : 'C ■ 9 3 I Distributed data base systems use a network as the underlying hardware facility rather than a single computer. This difference impacts the behavior of a data base system. In a network it can easily be the case that two or more copies of a data base are cheaper than a single copy. By going to multiple copies we can reduce query transmission cost. This saving can offset the additional costs incurred by the additional copy (storage, etc.). Related literature occurs in several areas: work on optimal location, work on multiple copies, or work on both combined. In the first category are papers (such as [Al]) dealing with the determination of the optimal site for a given task. In the second category ([C5], [B2]) are studies indicating the advantages of multiple copies over single copies according to some specific criteria (e.g. availability). Both issues are combined in the file allocation problem. In the file allocation literature [e.g., CI, C2, C3, C4, etc.] we find that the need for multiple copies is used to justify the search for an optimal number and location of copies of a data base. Alsberg [Al] has shown that in heterogenous networks (specifically in the ARPA network) it is quite possible that substantial savings might be ob- tained by using the proper computer; i.e., that in many situations transmission costs can be more than offset by the savings in time and money produced by the use of the right computer. This opens the door to additional possible savings in a distributed data base system by maintaining different organizations or data at each site [A2]. Even without an economic rationale, a distributed data base system could be dictated by requirements imposed on the system's design. In many applications availability or response time are design constraints. Putting a single copy of a data base at the best site might improve availability and response, but only up to a certain point. Beyond that point distributed data base systems (including multiple copies) are required. In the file allocation literature (e.g. [CI]) we find various examples in which multiple copies could produce transmission cost savings (see section II. 2) that will offset the additional cost of an extra copy. Thus, in a search for the system with the lowest operational cost, we might obtain an unexpected surprise. It could well be that besides improving costs we will improve avail- ability and response time as well. Considering these comments, we should expect a literature full of specifications for working distributed data base systems; this is hardly the case. There exists a fair number of papers describing different ways in which copies of a file can be allocated in the network to improve operational costs [CI, C2, C3, C4, C5, L2, SI, Ul]. However the emphasis of these papers is on the mathematical side. The authors fail to justify the workability of the distributed data base models they use. Casey [CI], for example, assumes that every site is a generator of updates, which are broadcast to all sites that maintain a copy of the data base. In general, this approach affects seriously the consistency and validity of data in the different data base copies and should be either rejected or modified. The literature shows that this has not been done. Moreover, Casey's work has triggered some extensions [L2, Ul] based on the same unsafe approach. Only one research group outside the University of Illinois has studied the actual implementation of a distributed data base system. Johnson and coworkers [Jl, J2] have presented a model (see chapter III) that attacks the synchronization problems. The discussion in chapters IV, V and VII will indi- cate the limitations of this model. Chapter II will be more specific about the advantages of a distributed data base system. It will cover the available literature that helps to support the usefulness of such a system. Chapter III will discuss the available models for a distributed data base system, including one of our own (the Reservation Center) . Chapter IV will evaluate the pros and cons of these models and will pinpoint some flaws. The major flaws will be discussed in chapter V, in which modifications of the models to solve some of the problems will be presented. In I ,C c I ■ 2 chapter VI we will return to the file allocation problem and make an a posteri- ori evaluation of the validity of available models. The presentation of some personal contributions in this area will be shared between this chapter and the appendices. Finally, chapter VII will present our conclusions. The numbering scheme for figures, tables and formulas includes the section number in which they appear as a prefix. I.e., figure II.3-a can be found in section II. 3, etc. Figures have letters of the alphabet for a suffix. Tables use arabic numbers. Formulas are indicated by small roman numbers. (For example, we refer to table II. 2-1, figure III.l-c, and expression II.3-ii.) Chapter II. WHY IS A DISTRIBUTED DATA BASE SYSTEM USEFUL? II. 1 Introduction In previous comments, we have mentioned that a distributed data base system could be cheaper and better from the standpoint of availability and response time. In this chapter the available literature is reviewed. This helps to stress the potential usefulness of a distributed data base system. At the same time, this review puts into proper perspective such subjective terms as "cheaper" and "better". For example, "cheaper" is shown to be related only to operational costs. Research and implementation costs (which are clearly nontrivial, since no distributed data base system actually exists) are completely neglected. This chapter is divided into three technical sections: 11. 2 network economies, 11. 3 availability, and 11. 4 response times. In section II. 2 we will discuss the economies that could be obtained by the proper use of network facilities. An important part of this section will be the consideration of the file allocation problem; i.e., where to allocate copies of a file in order to minimize operational costs. In the last two sections we will refer to the existing literature to show clearly how avail- ability and response time are improved by having multiple copies. II. 2 Network economies A distributed data base system could offer interesting economies by capitalizing on resource variety and/or by the proper exploitation of the topology of the network. 9 * w <• -» > ! 2 S „ i 9B Our first point is justified by Alsberg's work [Al]. He made an experiment in which different types of programs were submitted to the different facilities in the ARPA Network. The result was that there was often a cost difference of one or two orders of magnitude between the best and the worst computer system. Furthermore, the roles of best and worst shifted as the type of program was changed. All the systems studied were among the best for some types of programs and among the worst for others. This result strongly moti- vates the study of optimal locations for a given data base, since savings can obviously be obtained. Moreover, if we have a multiple copy distributed data base system, there is no restriction on the way the information is locally organized. Thus similar savings could be obtained by relaying a data base operation (query) to the site with the best organization to treat it [A2]. Considering the potential that this area offers, we find it surprising that, as far as we know, no additional work has been done. This deficiency might be explained by the specificity of the required research. That is, one has to get involved with specific models of computers in a specific network. The multi- plicity of models and operating systems makes this area not very attractive for research. Another way of profiting from the existence of a variety of resources is by load sharing. An overloaded computer could share part of its load with an underutilized one. This generally improves the overall throughput of the net- work. However, the effect of load sharing in network economies is indirect. Transferring a job to a second computer with unknown qualities does not guaran- tee a cheaper cost; quite the contrary. However, the best utilization of the network could permit higher throughput rates and probably cause lower cost per unit of service. Further discussion of this topic is given in section II. 4. A second source of network economies is the exploitation of the topology of the network. The network may have sites which have different query and update requirements for the use of a given file. It is clear that the topology will be important to the decision as to where file copies are placed and how many copies are needed for maximum economy. This problem, known as the file allocation problem, has been studied by various researchers [e.g., CI, C2, C3, C5, L2]. The remaining part of this section will be dedicated to a demonstra- tion of the potential savings that a good multiple-copy file allocation could effect. We hope that this will stimulate the reader's desire for an operational distributed data base system. A more comprehensive discussion of the file allocation problem may be found in appendix A. In general the formulation of a file allocation problem includes two types of transactions: updates and queries. The traffic required for such transactions through a given network depends (among other things) on the number of copies of the files. Since a given update must be seen by (i.e. sent to) all the copies of a file, the more copies we have, the higher the update traffic will be. This traffic will be minimized if there is only one copy of each file. On the other hand, the addition of new copies of a file tends to reduce the network query traffic, up to the point where every site has its own copy of the data base and responds to its queries locally, so that there is no network query traffic. Clearly there is a tradeoff. A simplified example follows: Let's assume that we have two sites, A and B, that use a given data base system. It costs A the quantity Q to process its queries locally and Q to process them at B. A price U must be paid by A to locally process the An updates it generates. A price U must be paid to send them to B and process AB them there. The storage cost to locate the data base located at A is a . In a • • ■ „ - 8 ■ • s similar fashion the quantities Q„_, Q„ A , U-,^, U^. and „ are defined. BB BA BB BA B If there is a single copy of the data base at A, we will have to pay Q..+U +a to satisfy A's needs and Qr> A + U to satisfy B's. If we decide to allocate a second copy to B we would probably save by being able to solve B's access needs locally (i.e., Q instead of Q ) but we would then have to pay BB BA for additional storage (a ) and transmission of updates (U._+U__). Thus the d AB BB cost shifts from Q^+a^+U^ to Q^U^aWWbA aS We m ° V6 from one copy to two copies. Comparing the above formulas term by term, we see that both share the Q AA +U..+a +U . term, but the one-site allocation has an additional Q . term and the two-site allocation an extra Q TJT ,+U_ 1J +o' T1 +u. T , term. Thus, the two-site allo- BB BB B AB cation is cheaper if Q r)t) +U T , T ,+o T) +U At , 10 U , Q >10 U__) . Under these assumptions, the additional cost of BA AB BA — BB the two-site allocation (Q^+U^+U.^-fa ) is less than 0.3 CL.+a^. Then the two- BB BB AB B BA B site allocation is cheaper than the single-site one if °- 3 Q BA + °B * J c = c + c transmission transactions modifications transactions Z — C' A.U..X. . (1-X. .) C .,-, .. -I C' A.U..X. .P.. modifications . , lk j ij k] i] X.. = 1 if j file is stored in the i computer and otherwise; i.e. the X..'s are the zero-one variables ij L. = length of j file C. . = storage cost per unit length of the j file at the i computer r. = number of copies of the j file (which may be computed from avail- ability constraints; see the discussion of r in section II. 3. 2) C* ., = transmission cost from the k computer to the i computer per unit file length .th 5-. = length of each transaction for the j file (i.e., amount of data to be shipped) />•/.• 13 u . = the request rate for the entire or part of the j file by the i site per unit time th P.. = the fraction of the time that a transaction for the i file by the ij i site is a modification (i.e., that the data must be shipped back) This formulation follows a series of arguments and assumptions that we won't discuss here. Chu's approach leads to a non-linear zero-one integer pro- gram. However, he demonstrates how this problem can be linearized at the expense of increasing the number of variables. Chu's approach is very sound mathemati- cally, but in general, as Levin [L2] points out, the solution of a moderately sized network problem is computationally very expensive. A third alternative to Chu's and Casey's methods is presented by Levin [L2]. Levin's work is closely related to Casey's but with a few major distinc- tions. He considers the allocation of programs as part of the optimization mechanism. He also considers situations in which query and update patterns are not known a priori and an adaptive system is required. Our main intention in the latter part of this section has been to demonstrate that a multiple copy allocation can in fact be cheaper than the single copy alternative. However this evaluation is dependent on the way we look at the file allocation problem, i.e. the model used. Three models were discussed to show the flexibility available. More details on file allocation are presented in appendix A and in [B4]. I 1 ■ 5 II. 3 Availability II. 3.1 Introduction Chu [C5] and Belford et al. [B2] have studied the effect that multiple copies of files in a distributed data base system have upon the availability of a given file. Belford 's results are more recent and much more complete than 14 Chu's. Because of the completeness of reference [B2], our discussion below will follow the treatment there fairly closely. The term availability will be used "to mean the fraction of time that a data base is available to respond to user requests or queries" [B2]. II. 3. 2 Chu's work [C5] Chu presented his study embedded in the development of his one-zero integer programming approach to the file allocation problem. Chu's intention was to obtain a formulation that could allow him to include availability as one of the constraints of his integer programming problem. He assumes that : 1. all computers in the network have an availability of A = up-time/ total-time = up-time/ (up- time + down-time) , 2. the network is completely connected (i.e. there exists a direct communication channel between each pair of computers) , and 3. all communication channels have an availability of C (defined simi- larly to A) . Under these assumptions, Chu finds that the availability of a given file with r copies (at different sites) would be A(l - (1 - AC) r ). (II. 3-1) Thus, the availability constraint could be included as a constraint A(l - (1 - AC) ) _> a for a given a. The value of r could easily be precomputed and presented to the integer programming problem as a lower bound on the number of copies for the given file. The formula II. 3-1 can be obtained by noticing that: AC is the availability of a computer and the channel that leads to it; i.e., the probability that we can establish a direct connection to a given site. (1-AC) is the probability that the site is not available to us. 15 (l-AC) r represents the probability that r sites are not available to us. l-(l-AC) denotes the probability that a least one of r sites is accessible, and finally A(1-(1-AC) ) indicates the probability that a given user who uses a computer with availability A (and does not have a local copy) is able to access at least one of a given set of r computers. As an example of the utilization of the expression 11.3-i, let's assume that we have a network of computers with a down-time of one hour per day (A = 23/24 = .9583) and channels with a .99 availability. Expression 11.3-i will tell us that the worst-case availability (no local copy) is .9092 for r=l, .9558 for r=2 and .9583 for r=4. This implies that if we have a single copy we should expect about 65.4 hours of down-time per month, about 31.8 hours for 2 copies and 30.0 hours for 4 copies. These down-times are clearly bounded by A; i.e., by 30 hours per month. It is not hard to imagine a context in which an improvement from 65.4 to 31.8 hours of down time per month would be highly desired. Some of the problems with Chu's approach are: a. Chu seems to assume in 11.3-i that the user's site does not have a local copy. No explanation is given for this assumption. If the user is allowed to have a local copy his availability will be A. Thus expression II. 3- i could be interpreted as the lower bound for the availability, since I* A > A(1-(1-AC) ). This seems to be a valid interpretation. If we are inter- ested in average availability we could transform the expression II. 3- i to A(l-p)(l-(l-AC) r ) + P A 11.3-ii ,r 3 ■•> •- ?* - 3 5 « ■a : \ • I ■ s 16 where p is the proportion of the users that have a local copy. Since p is dependent on r, it might be difficult to determine an r from a given value for the expression II.3-ii. b. Chu's approach implicitly assumes that the files are static or, if dynamic, that they are updated simultaneously. In this circumstance, when a site fails we can simply switch to another one and continue our work. This is hardly the real case, and an additional recovery time might be required. c. The assumed topology (complete connection) and routing technique (direct communication) are the easiest alternatives available. These assump- tions, as well as that of the uniform behavior of communication channels and computers, give rise to the simple parameters A and C. In real life this might be an oversimplification. The study of the multiple elements hidden behind A and C (routing, data communication processors, etc.) is an interesting topic. An extension of Chu's model may be found in [B2, appendix 1]. Dif- ferent values for the availability of different computers are allowed, and only slightly more complicated expressions are obtained. II. 3. 3 The work of Belford et al. [B2] The approach of Belford and co-workers is much more realistic than Chu's. Rather than assuming that files are static or that updates are applied simultaneously, they assume that there exists a single master copy and all other copies are treated as backups. All updates are applied to the master copy as soon as possible. Updates are applied to the backups according to one of the following methods: 1. Running spares . Backup sites apply the updates almost as fast as the master, and 2. Remote journaling . Backup sites periodically receive up-to-date copies of the master copy. Between these times, updates are only 17 journalized at the backup site (i.e., only the master copy processes the updates) . Some of the results of the study indicate that, from the availability point of view, the usefulness of method 2 is a function of how long it would take to bring the backup copy up to date. If the backup is stored on tape then availability would only be slightly improved, and the additional work would hardly be justifiable on the basis of the availability improvement. (Actually this kind of backup is used to prevent data base destruction.) This is simply because by the time we get the backup copy in a usable state the primary copy is likely to be up again. It is important to note that remote journaling over a network is not feasible on an extended basis. Moderate-sized files would take hours to trans- mit through the type of communication lines currently used. For example, in the ARPA Network it would take more than 11/2 hours to transmit a file of 10 bytes. With availability in mind, we thus turn to method 1. There are various possible strategies as to what to do when the primary fails. If we allow a backup copy to take over the primary's role only while the primary is down, we end up with what Belford and co-authors present as strategy 2. For this strategy the availability is expressed by: 1 D+L+kY 2 F+X+kX where : F = mean time between computer failures, assumed to be the same for all host computers. X ■ expected time to repair computer. L = expected time to load the data base copy at the remote site. c 9 • » i 5 :■ p 18 Y = time that the audit trail of updates has been growing (i.e., time since the copy was correct) . k = the ratio of update arrival rate to update processing rate. D = time delay between when the master fails and when the remote site determines this fact and starts to get its copy ready for use. Using the same notation, we have that the single copy availability should be A = L_ o F+X+kX If we study the improvement (I) that we get by using A~ over A (i.e. I = (A„-A )/A ) we have that: I o o I = X+kX-D-L-kY Assuming some reasonable parameters (k = .01, D = .01 hrs., X = 1 hr., L = .5 hr. and Y = F) we obtain figure II.3-a (figure 3 in [B2]). In this figure we have plotted I as a function of F (i.e. I = ^-= - .01). The effect on A of r O varying F has been plotted as a reference. One should not be misled by the small relative values of I. For example, for F = 20 the improvement is .015. This implies that the down-time would change from 35 to about 24 hrs /month if we choose to have more than one copy. In general the work of Belford and co-workers is much more comprehen- sive than Chu's but it is still not a complete solution. Some of the missing topics are: 1. Inclusion of the communication channel in the formulations. The decision to ignore communication channels is probably based on studies of the ARPA Network [Fl]. More than one channel failure is required to block the transmission of a message through the ARPA Network, 19 0.1 - ,0 0.05 0.5 10 20 Figure II.3-a 30 40 Single-site availability A and fractional o improvement I through use of strategy 2. Parameters are k = 0.01, D = 0.01 hr., X = 1 hr., L = 0.5 hr., and Y = F. 50 »c making the availability of Chu's communication channel essentially equal to one. It could be useful to show convincingly that this logical omission is actually correct. 2. Communication processors. ARPANET experience indicates that the time between failures of network communication processors (IMP's) is much longer than that of the computers (hosts) they serve. Thus, it is reasonable to neglect the communication processors in an ARPANET-like environment, but probably not in general. Due to these two omissions, everything related to network topology, routing and connectivity was ignored. 3 7* - s ■1 II. 3. 4 Discussion Our main interest in availability is to prove that it is improved by a distributed data base management. Both studies in the literature give us the 20 necessary tools to measure availability for the models assumed. Armed with these tools, we could easily compute the number of copies that our distributed data base system should have. Both studies neglect to consider (and we will do so too) the inverse problem. Given a certain number of copies, what should be the best allocation to maximize availability? The answer to this question, although interesting, is tangential to our study and won't be considered. What we were looking for was to study the impact of multiple copies on data base availability. The discussion and examples presented clearly justify a distributed data base system when availability is at stake. II. 4 Response time Like availability, response time might be given as an a priori design restriction. It might be one of the factors taken into account in a formulation of the optimal file allocation problem or it might be studied as an independent topic. Chu [C5] has adopted the first approach. In the formulation of his integer programming solution for the file allocation problem, he introduces a set of restrictions: X, .a... < T. . for i ^ k, 1 < j < m, where : X^ . is 1 if and only if the j file is stored in the k computer, and is otherwise, a.., is the expected time for the i computer to retrieve the j file from the k computer, and .th T. . is the maximum allowable retrieval time of the j file by the .th x computer. 21 (This simple formula requires the assumption that there is only one copy of each file in the system.) Chu uses a simple model and queueing theory to obtain an approximation to a. . We can treat response time as the objective function in the file allocation optimization. For example, for a given availability (i.e. for r copies of the file), we could find the allocation that minimizes response time. There are no reports of this approach in the literature. There is only one contribution to the generalized study of response time in a distributed data base system. Response time can clearly be improved by having as many copies of the data base as possible. It can also be improved by locating these copies at appropriate sites. Finally, it can be improved by load sharing. Belford et al. [Bl] developed a mathematical model for response time in a distributed data base environment. The concern was to establish when response time can be improved by sharing parts of the load of a given site. If the local load satisfies a given formula, sharing is initiated. For simplicity, it was assumed that all queries arise at a single site. Different cases are considered, such as uniform or non-uniform distribution of excess queries to the remote sites. Distributed data bases will unquestionably improve response time. This improvement can be produced by proper choice of file allocation, load sharing, or simply by having a multiplicity of copies of the data base. : >C -2 f I • 6 • m 5 22 Chapter III. EXISTING MODELS OF DISTRIBUTED DATA BASE SYSTEMS III.l Introduction Once we are convinced (by the previous chapter) that implementing a distributed data base system is a sensible thing to do, we have to decide how to do it. Casey's work [CI] assumes that queries and updates should be handled as described in figure III.l-a. Queries are sent to the closest site that pos- sesses a copy of the data base and updates to every site that maintains a copy. ("Closest" may be defined broadly in terms of lowest cost.) Unfortunately this simple approach does not work in general. The reason is that we have problems in synchronizing our various copies of the data base. The synchronization problem will be best explained by the presentation of the following three examples. 0-(dbm) Q-(dbm) DBM QUERY © (HQ Q-(dbm^ \ \ W ■'© UPDATE Figure III.l-a Illustration of Casey's model for treatment of queries and updates. U is the user site while DBM is a data base manager with a local copy under its supervision. 23 EXAMPLE 1. No Problem. We look at two sites, A and B, maintaining copies of the data base (see figure III.l-b). Two additional sites (they could also be A and B) generate an update each. The updates are nearly simultaneous and both affect the same field x. The first one adds 1 to x, the second one adds 2. Because of the network topology and delays, it turns out that the updates arrive at B precisely in the opposite order that they follow for A; i.e., A first adds 1 to x and then 2 more; B first adds 2 to x and then 1 more. No problem is evident in this example. ADD 1 TO B TO A A: x' = x+1+2 B: x' = x+2+1 ADD 2 Figure III.l-b Graphical explanation of what occurs in example 1. The dashed lines indicate later arrival due to delays. C 3 « ■« :•* '.,■5 ;i 3 -■■■ EXAMPLE 2. Numeric Inconsistency. We now change the updates to read "increase x by 10%" and "add 10 to x" (see figure III.l-c), leaving everything else the same. The results are that A and B won't agree any more. The value for x at site A will be 1.1 x +10 while for B it will be 1.1 x +11 (= l.l(x+10)). 24 INCREASE BY 10% TO B TO A A: x' = l.lx + 10 B: x' = l.lx + 11 ADD 10 Figure III.l-c Graphical explanation of example 2, The dashed lines indicate later arrival due to delays. EXAMPLE 3. Logic Inconsistency. If we change the updates once again to "change all entries with a w in field x to z" and "change all entries with a w in field x to y" (see figure III.l-d), we get involved in a logic inconsistency. Site A, upon arrival of the first update, will change field x to z and won't be able to recognize or service the second update (change field x to y) . A similar problem will arise at site B and cause the undesired consequence: logic inconsistency. CHANGE X TO Z TO A A: x B: x CHANGE X TO Y ? -*- y ? ■* z Figure III.l-d Graphic description of example 3. The dashed lines indicate later arrival due to delays. 25 We therefore must discard Casey's approach, at least in its primitive state, and start looking for some alternatives. In the present chapter we will present three models formulated with this purpose in mind. These models are: a) Johnson's model (section III. 2), b) Bunch's model (section III. 3), and c) the Reservation Center model (section III. 4). The relative advantages and disadvantages will be the subject of the next chapter. III. 2 Johnson's model III. 2.1 Context Johnson's model is described in two papers [Jl] and [J2], This model postulates the existence of data base managers as interfaces between users and data bases. Each data base manager maintains a local copy of the full data base. Johnson's data base consists of a collection of entries (or records). Each entry is defined as a five-tuple (£,V,F,CT,T) where: il is a selector (or location) , V is the associated value, F is a deletion flag (see section III. 2. 3), CT is the creation time-stamp (see section III. 2. 3), and T is the time-stamp of the last operation which modified the entry (see section III. 2. 2). Thus each data base entry contains only one data item (V) of unspecified char- acteristics. The selector is used to identify the item; only one item is identified by each selector. We can think of the selector as the address of the data item. 3 1 J 9 .» m \ ! 5 s ■I 9 eg 5 26 Updates are transmitted from one data base manager to another by sending a new data base entry. Thus updates and data entries have the same format . III. 2. 2 Synchronization mechanism Inter-manager synchronization is obtained by using the time-stamp component (T) in the updates. A time-stamp is defined as a pair (x,D) where t is the time and D is the identification of the site where the entry or update originated. It is thus possible to establish a unique order of precedence for the 5-tuples. A time-stamp T1(t,D ) is said to precede (or "be older than", or "be smaller than") another time-stamp T2(x ,D_) if and only if t, is smaller than t~; or, if t,=t 2 , if D-i is smaller than D ? . A 5- tuple precedes another 5- tuple if its time-stamps do so. Two 5-tuples with the same time-stamp are considered to be the same 5-tuple. Thus only one update per time-instant is allowed . For example, let us assume that we have two updates, u.. and u~, generated within a short interval of time. Update u. originates at site D.. at time t- (say 10:00) and is designed to modify location £ to value V . Update u_ enters the system at site D„ when the clock indicates t~ (say 10:01) and is intended to assign value V to the same location Z. Data base managers A and B receive these updates in a different order (see figure III.2-a). If each data base manager applies the updates in the order it gets them, we end up with an inconsistency. That is, the data base at site A ends up with V~ at I while site B has V . However, the time-stamps could be used to detect this situation. When u arrives at site B, the data base manager could determine that u, is older than u„ and discard it. Both data base managers would then end up with the same value (V ) for location £. 27 ASSIGN V TO ft, 10:00 TO B TO A assign V to S, 10:00 assign V 2 to S, 10:01 assign V to S, 10:01 assign V to S, 10:00 ASSIGN V 2 TO A, 10:01 Figure III.2-a Update transmission for example in text Dashed lines indicate later arrival. III. 2. 3 Operations available Four operations are defined: a) Selection. Given a selector, the current associated value V is returned. b) Assignment. A selector and value are given and the given value replaces the old value associated with the selector. c) Creation. A new selector and an initial value are given and a new entry is added to the data base. d) Deletion. A selector is given and the existing data base entry indicated by the selector is removed from the data base. We next discuss the problem of incorporating these operations into the data base scheme described above. As we shall see, there is considerable variation in the difficulty of handling these operations. Selections correspond to queries in the terminology that we have been using. A selection transmits a selector (I) to one of the data base managers, c2 - 3 eg 28 which in turn responds by sending the associated value (V) in the 5-tuple indicated by £. Assignments are the basic type of update considered in this model. They are expected to be transmitted to every data base manager. An assignment is designed to replace the old data base entry for the same selector by a new 5- tuple. The new 5-tuple differs from the old one in the V and T elements. V indicates the new value that should be associated with the given selector and T indicates the time-stamp of the update (i.e., the time it occurred and the manager which created the update) . When an update arrives, the manager will decide which of the two 5- tuples for the same selector should remain in the data base (i.e., the one that is already there or the incoming update) . This decision will be based upon the time-stamps. The one with a more recent time-stamp will become the new data base entry. If it's the 5-tuple that was already there we simply ignore the incoming update, otherwise the update 's 5-tuple becomes the data base entry and is properly stored. For example, in our last example data base manager A will have the 5-tuple (£,V ,F. ,CT. , (10:00, D )) stored in the data base when the 5- tuple (JI,V„,F ? ,CT ? , (10:01, D„)) arrives. Data base manager A will then store the later 5-tuple in its data base. On the other hand, data base manager B will have 5-tuple (j^V^jF^CT^, (10:01, D„)) stored in its data base, and will ignore the u 1 update because it is too old. (We have thus obtained inter-data base consistency.) Creation can in general be handled in two ways: explicitly or impli- citly. Explicit creation requires that a specific create message be provided in order for a data base manager to create a new record. A problem with such a method is that, due to network characteristics, a given site could legally get updates for locally nonexistent entries. (For example, A creates x and notifies 29 B. B creates x and updates it. C gets B's update before A's creation.) Implicit creation means that if an assignment for an unknown entry is received, the given entry will be created. Thus creations are transmitted as assignments. Implicit creations avoid that problem described for explicit creation. However, a new problem appears: transmission or other errors could produce undesired creations. Johnson's model uses implicit creation. In this model creations are treated as assignments with one small difference. We delayed until now the discussion of the CT element in the 5- tuples. The CT element corresponds to the creation time-stamp and has the same format of any ordinary time-stamp. A creation entry looks like an assignment; the initial value is stored in V. In addition to this, the time stamp T is duplicated into CT. In an update, CT must contain the time when the entry was created. Therefore the data base manager can distinguish between creations and updates by whether or not T=CT. (The reasoning behind the CT element will become clearer when we discuss the delete operation.) Functionally, creations are treated as assignments; i.e., there must be the same conflict solving based upon T, etc. The most troublesome of the operations is the deletion. Deletion can- not be allowed to produce the desired effect of immediate removal from the data base, because pending updates (with older time-stamps) could incorrectly cause the entry to be "re-created". Furthermore, it is not possible to simply ignore all future references to the same selector, because a valid re-creation could occur. (One way to avoid this problem is not to allow entries to be re-created. But this is not assumed in this model.) To solve this problem, the F and CT elements were introduced into the 5-tuple (£,V,F,CT,T) , where F is a deletion flag and CT is the time-stamp of the creation, as mentioned before. A creation will have CT=T and F = not deleted, an assignment CT : a •a ■ 1 9 ,9 >.* "5 ,1 ■ "J "i r* I Thus each data base manager keeps two sets of tickets, s 1 and s„. The set s 1 is known as the current set, s„ as the next set. Whenever a new set of tickets arrives, s becomes s.. and the incoming set becomes s 2 « Whenever the data base manager exhausts s before a new set of tickets arrives, it transforms s_ into s, and notifies the reservation center. The reservation center is then expected to expedite the generation of the next set of tickets, while at the 40 same time increasing the allocation of tickets to the given data base manager to help to avoid future shortages. When a new set of tickets (s) arrives, the local time at arrival is attached. The validity of a given set of tickets extends for 2m-hS (in general nm+6) where m is the expected time until s becomes s. , (i.e., begins to be used) , m is the expected time for s to be used as s.. , and 6 is a small tolerance factor for unexpected delays. When this period of validity expires, s„ is transformed into s . An "alert" state could then be entered. During this alert state, the reservation center is watched carefully to determine whether it has failed. A timeout could be used to detect such a failure. Recovery procedures could then be initated. (See section III. 4. 4.) III. 4. 3 Operations available If we desire to include Johnson's model as a special case of the RC(m,n) model, we have to pay the price of limiting the number of operations. Thus we will assume that the discussion presented in section II. 2. 3 is appli- cable here with almost no exception. For reasons presented later (section IV. 2), we will slightly modify the procedure for detecting that it is safe to remove a deleted entry. In the RC(m,n) model a deletion is performed in two steps. 1) When the deletion message arrives, the value V of the corresponding 5-tuple is modified to a special value indicating that the entry has been deleted. 2) Whenever all tickets with a smaller value than the ticket of the deletion message have arrived (this is discovered by examination of the proper row in the A matrix), we can proceed to the physical deletion, if the value V still indi- cates deletion. The actual set of operations could be increased if we follow the ideas to be presented in section IV. 5. However, section IV. 6 will point out the 41 enormous restrictions on any such extension. For this reason, we won't make an effort to extend Johnson's set of operations. Such an extension would only complicate our comparison of the models (in the next chapter) . III. 4. 4 Miscellanea Recovering from a reservation center failure implies first establishing a new reservation center and second providing this new reservation center with a lower bound for the ticket numbers that it should generate. This lower bound should be bigger than the biggest ticket in any of the sets of tickets from the previous reservation center. These two steps could be done at once. When a reservation center failure is suspected, a data base manager will call for an election by sending an election message to all other managers. This message includes its "political promise"; i.e., the lower bound that it would assign if elected. If any of the receiving data base managers thinks that it is more suited for the post of reservation center, it will return a "keep quiet" message and broadcast its political promise. On the other hand the data base manager would indicate its acceptance of a political promise by sending a "submit" message. This process could produce a chain reaction of political promises which should end up with only one manager getting only submit messages. At this moment the given data base manager will set itself up as the new reservation center. f " I ' ■t* \ fca ■ ? & 42 Chapter IV. EVALUATION OF THE MODELS IV. 1 Generalities In this chapter we will try to make an evaluation of the models described in the previous chapter. From many points of view we are comparing apples and oranges, especially because we are not aiming toward any preestab- lished environment. With this in mind, we look for features that will charac- terize the advantages and restrictions of each model. We will start by analyzing each model separately, (sections IV. 2, IV. 3 and IV. 4), studying completeness and applicability to an environment similar to the one it seems to address. We will then (section IV. 5) compare them in an environment which could make use of any of the three models. Next (section IV. 6), we will study the limitations that each of the models has if we attempt to apply them in a more general environment. Finally (section IV. 7), we will present our conclusions and note the unresolved topics of interest. These topics will be treated in the next chapter. In our evaluations we will always try to keep in mind that we are looking for a workable version for a distributed data base system. Thus, aspects such as reliability will be high on the list of things to look for in each model. Reliability during normal operation will be considered, and prob- lems encountered will be presented. We will also be concerned with reliability during failures, either single failures (e.g. a single computer goes down) or multiple failures (e. g. , several computers fail simultaneously), including net- work partitioning which is possibly the hardest problem in this area. Some A network is partitioned when it is divided into two or more subnetworks that do not maintain communication between them. 43 other aspects to evaluate in each model will be the set of available instruc- tions, possible protocol problems, memory requirements, etc. IV. 2 Evaluation of Johnson's model In studying Johnson's model we first have to correct a small omission in order to make the model workable. Johnson does not mention how and when the vectors of his A matrix should be exchanged by the different data base managers. In his discussion there is no apparent reason why a restriction should or should not be imposed. This seems to indicate that the exchange could be done indepen- dently of the update exchange, which is expected to be sequential. Failing to impose a sequential restriction on the broadcasting of these vectors could produce races, as we will describe below. The A vector from data base manager i could not be sent to data base manager j until all the pending updates from i to j are successfully transmitted. This means that we expect these vector interchanges to follow the same sequential order that the updates do. Thus, if data base manager i wants to send its vector to j it should store it in the proper queue and await the appropriate time. An example of a race condition follows. Johnson says that a given entry with a delete mark can be physically removed when everybody has been notified of the deletion. Let's assume that data base manager k originates a delete message for record R. A data base manager j will recognize that the delete message has been seen by everybody when the k column of its A matrix has values larger than or equal to the tag of the update deleting R. However, it could well be that everybody has received the deletion message, but two data base managers, i and j, have been slow in sending each other some older updates for record R. For example, i and j may have received the delete message, but in i's queue for messages to be sent to j there still exists a message referring to R. If manager j deletes the entry because the k column of its A matrix 1 * ' 5 44 satisfies Johnson's condition, then it will incorrectly re-create R when it receives the old update from i. The particular race condition of the previous paragraph could be avoided if we take one of the following approaches. a. Instead of checking the k column, manager j could check the j row. If all the entries in the j row are larger than the time-stamp of the deleted entry, the entry can be removed. It is certain that nobody could possibly have an older update for R. b. Manager j should not re-create an entry from an assignment (CT 5 g 46 We will consider this problem further in the next chapter.) After the subnet- works get together communication should be established between each pair of previously non-communicating data base managers just as if a single failure had occurred. We are handicapped in evaluating Johnson's model by the fact that he does not give enough detail about the complete protocol. (For example, there is nothing on how to detect lost messages, etc.) IV. 3 Evaluation of Bunch's model Bunch does not pretend that his model is complete. On the contrary, he points out a few of the model's problems himself; namely: 1. The speed of the algorithms is low. 2. The cost to produce and to operate the scheme is high. 3. The missing-update problem is not satisfactorily solved. 4. Unnoticed problems probably exist, since detailed analysis and imple- mentation have not been performed. In his first point he refers to the algorithms to handle each one of the operations and the recovery procedure. The amount of interaction between the primary and the backups is high and thus has a considerable impact on the throughput of the system as well as its cost (second point). We would add that the centralized protocol brings some additional complications. The model's throughput is fully dependent on the throughput of the primary. A high overhead in the primary computer or in the communication channels going to and from the primary would affect the system's capacity; i.e., we have a potential bottleneck. It should be noted that we are referring only to consistency problems. However, other types of problems could arise. For example, a user could be making decisions based upon data that is completely outdated. 47 The third point refers to the problem of losing updates if multiple failures occur. It is possible that all holders of a given update fail (i.e., user and primary) before it is broadcast to all other sites but after the sequence number U has been returned from the primary. If this happens, it is quite possible that a new primary will reassign the same sequence number to a different update with unfortunate consequences. The solution that Bunch presents is that the primary should: a) send the incoming update to all the backups, together with the newly assigned number U, b) send the number U to the user, and c) signal all backups to go ahead and apply the update. The reason why Bunch does not consider this a satisfactory solution is clear; the overhead of this solution is very high and would signifi- cantly reduce the throughput of the system. The last two items on Bunch's list of problems bring up an essential point: How resilient is the model? Unfortunately not very. As Bunch points out, if multiple failures occur during certain intervals (e.g., duirng primary recovery), the system breaks down. Network partitioning is a severe problem for this model. When a network gets partitioned into various subnetworks, we might end up with various primaries. We then have a problem of detection and correc- tion. Detection relates to the determination that some sequence numbers have been duplicated by the various primaries. Correction involves merging the different versions of sequence numbers together. This merger is in general extremely complex. No solution is known to us. One of the problems with this merger is that updates made by one subnetwork could completely nullify others made by a second one. Example 3 of section III.l shows the kind of logical inconsistencies that we could encounter. A possible palliative solution is discussed in section IV. 4. * 4 e ' >• m i m i 48 example: a. b. Some problems must be solved for this model to be operational. For In Bunch's recovery method (algorithm 6 in [B5]) only user-controllers which are operational cooperate with a newly elected primary. Thus updates from a user-controller which is down are not taken into account even if other backup copies have already received them. Backups should probably be more active in the recovery process. The backup copies journalize only after the primary tells them to do so. This happens only after the primary has found out that a backup data base manager is down. Thus, by the time the primary notifies the backups, they might have disposed of a required journal entry. With respect to the potential set of operations that could be sup- ported by a model of this type, there is no possible complaint. Since all the data base managers are expected to apply the updates in precisely the same order, there is no source of complication in that area. Any set of operations could be supported. Bunch presents three types of operations: critical read, non-critical read and modifications. The critical read was introduced to support those types of applications which require an up-to-date version of the data (e.g., an air- line reservation system). We should point out that the information that gets back from a critical read can not be considered up-to-date any more. As soon as the primary answers such a request it could get involved in a modification that changes the status of the data returned. In such cases some type of a test-and- set operation (or other semaphore) might be useful. (For example, we could implement an operation that adds an item to the reservation list and notifies the user whether it was successful or not.) Reliability problems (e.g., the missing update issue) make this a complex solution. What Bunch seems to have in 49 mind to solve this problem is to send a modification, followed by a critical read to verify the success of the modification. It is our belief that the scheme of critical and non-critical read is unnecessarily unclear. It is not clear who is going to indicate that we want one or the other. If a user makes an update and then decides to verify its correct execution, it is clear that he wants a recent version of the data base to be the one which answers the query; but is a critical read enough? The update can be trapped in many places. The user might not get the expected result and might wonder if the update was incorrectly presented to the system or hasn't gotten to its destination. We therefore suggest that the read operations be slightly modified. Instead of two separate operations we would have one, but with a parameter (£u) . The £u parameter will indicate the sequence number of the last update that must be seen before the answer is evaluated. Based on the value of £u and the communication status between the user-controller and the primary, the controller might decide that the primary should answer a given read (&U relatively recent; i.e., a critical read) or that any other copy might be queried (£u relatively old). This approach could improve the response time. The user-controller could maintain an £u parameter for each user and take appro- priate action whenever a query is requested. However, this should be decided by the particular application manager. With respect to memory requirements the overhead is relatively small. There is no need of any additional storage per record as in the other models. Only some space for communication tables (e.g., the precedence list) and the journal is required. ■ 1 m B ■■*> 3 IV. 4 Evaluation of the Reservation Center model The Reservation Center model is in some sense a combination of the previous two models. Thus it inherited some of the advantages and disadvantages 50 of each of those models. The broadcasting protocol (i.e., how, where and when updates are sent) and the application strategy (i.e., resolution of conflicts by comparing tags) are similar to those of Johnson's model. The election process (for reservation center recovery) is closely related to Bunch's primary election process. As an outcome of this inheritance, the Reservation Center model has the same limitations that Johnson's model has with respect to memory require- ments and the restricted set of operations which are available. From Bunch's model it inherited the problems with partitioning (a palliative solution will be discussed later in this section) and elections. On the other hand, we avoided the inter-clock synchronization problems discussed in section IV. 2 and the bottleneck potential discussed in section IV. 3. The bottleneck is avoided because the reservation center's activity is much more limited than the primary's. Each period of activity of the reservation center is expected to provide enough tickets for an interval of length m, while the primary must take separate action to provide a number for each update. The Reservation Center has a few characteristics of its own. None of the computer systems that we know of is capable of precisely reflecting the order in which things happen in real life. For example, if we have two ter- minals connected to a computer, it is not clear that, if two users send a message to the computer almost simultaneously, the order in which the computer will honor them will be related to the precise time they occurred. Polling order, interrupt priorities, cable length, etc. are capable of altering the real order. We thus say that there exists an uncertainty period. Given the average speeds of computers and I/O channels, we would say that the uncertainty period of the previous factors (polling, etc.) is relatively small. In Johnson's model, the clock synchronization problem causes a potentially large uncertainty 51 period. What has been done in the Reservation Center model is to incorporate this uncertainty period into the model. If two updates use tickets from sets generated at the same time, it is not certain (in general) which will be honored first. Thus the uncertainty period has been expanded to be at least of size m. Siven that perfection is not possible (i.e., it is impossible to eliminate the uncertainty period) , the disadvantage imposed by this model seems to be toler- able for small m. It is not clear that the absence of uncertainty is really a highly desired feature. A slower typist could produce similar results anyway. In the Reservation Center model, the reservation center periodically sends a message (a set of tickets) to each data base manager and expects to receive an acknowledgment. The reservation center thus has a reasonably up-to- iate knowledge of the status of the distributed data base system. This know- Ledge could be used to obtain a palliative solution for the partitioning problem; namely: the reservation center should not issue new sets of tickets if a majority of the data base managers have not acknowledged the reception of the last sets. (This majority could be a weighted majority in which different data base managers have different weight factors.) The inactivity of the reservation center (due to deliberate inactivity, partitioning, failure, etc.) would trigger an election process, which, if a majority of managers is available, would establish a new reservation center. The above solution was called palliative because some applications might not be able to afford the inactivity that being a member of a minority of managers (due to partitioning or to multiple failure) would require. These managers would have to suspend their update activity until a majority is estab- lished again. A similar technique could be used in Bunch's model. However, the primary does not have the same control over the other managers that the a m* a Mi ■ i 3 p :9 - 52 reservation center does. That is, there is no interaction when the manager is inactive. An adequate extension to Bunch's protocol (e.g., some kind of "are you there?" message for periods of inactivity) would be required. IV. 5 Comparing the models in a similar environment IV . 5 . 1 Introduction To be able to compare the models effectively we decided first of all to study them in an environment that could make use of any of them; i.e., in an environment that forms a least-common-denominator. We will later eliminate this restriction and deal with the inherent limitations of the models when we try to apply them in a more general environment . The least-common-denominator environ- ment will be covered in this section while the unrestricted comparison will be left for the next one (section IV. 6). In this section we will start by defining the least common denominator (IV. 5. 2) and then (IV. 5. 3, IV. 5. 4), assuming that environment, try to answer several key questions. The following outline of the questions we will be examining also serves as a guide to sections IV. 5. 3 and IV. 5. 4. IV. 5. 3 How good are the models during normal operation? IV. 5. 3.1 How well do they maintain the "real" update order? IV. 5. 3. 2 How fast are they? IV. 5. 3. 2.1 Local application delay. IV.5.3.2.2 Non-local application delay. IV. 5. 3. 2. 3 System's throughput. IV. 5. 4 How good are they during failures? IV. 5. 4.1 During fatal failures. IV. 5. 4. 2 During non-fatal failures. 53 IV. 5. 2 The least-common-denominator environment To set the ground rules for our comparison, we will evaluate the performance of the different models in a similar environment. This environment has been chosen to be the least common denominator of the target environments of the different models. We thus assume: a. We are dealing with a single computer network. Thus any comparison based on the use of one particular computer system or network as opposed to any other will be ignored. For example, when we will be dealing with the different timings, we will define the time an update enters the system as the time the update was first read by a data base manager. Thus, we neglect queueing delays, operating system overhead, etc., that would be common to the three models. b. All models are assumed to have an update broadcasting protocol that guarantees the correct and sequential transmission of updates. c. A network delay with the characteristics described by Kleinrock [Kl,K2^ is assumed. Whenever specific numbers are needed to make a point, we will turn to the ARPA network data. Naylor et al. [Nl] present some observed data (collected in December 73) for the ARPA network. They give the average round-trip delay as a function of the number of 2 hops between the source and destination of a message. The results shown indicate that the mean round-trip delay is 50 milliseconds for 1 hop and 809 milliseconds for 13 hops (the maximum presented) . ,C i s Round-Trip means the time it takes from the moment a message is sent until the positive acknowledgment, the RFNM, is returned. 2 A hop is the transmission of a message from one communication processor to another. The number of hops is equal to one less than the number of communica- tions processors touched by a message. 54 d. In general the network delay time will be dependent on communica- tion capacity, network topology, message transmission protocol, net- work load, etc. In any case most current networks have delays of less than one second. The set of operations and the data base organization are assumed to be as described for Johnson's model. (See section III. 2.) IV . 5 . 3 How good are the models during normal operation? IV. 5. 3.1 How well do they maintain the "real" update order? As mentioned previously, none of the models presented (nor any other that I can think of) reflects perfectly the "real" update order; i.e., the actual ordering that the updates will follow differs slightly from the real order in which they occurred. This is because of inherent characteristics of the models and because of physical properties, such as propagation delay. Rather than on perfect replication of "real" update order, emphasis is put on perfect consistency. This means that slight defects in the ordering are accepted if we guarantee the same ordering in every data base manager. Although it does not seem to be a critical issue, we now study the uncertainty period (i.e., the probability that an update entering at time t gets an ordering tag larger than that assigned to an update entering the system at time t+At) . a. Bunch' s model . In Bunch's model, the most obvious cause of asynchrony is the different network delay for different data base managers; i.e., two updates, u. and u 2 , entering the network in that order through data base managers DBM. and DBM. respectively, will obtain an inverted name order if the difference in delay from DBM. to the primary as compared with the delay from DBM. to the primary is enough to compensate for the difference in the time u and u arrived. 55 The other end of the line, the primary, will be responsible for some added delay, but as long as it follows a first-in-first-out discipline its effect is almost null. Finally, we have potential delay from the protocol implementation, since it is required that the order of updates generated at a single site be preserved. This problem has been eliminated by our assumption of a least common denominator . We are left with network delay as the only source of asynchrony in this model. Thus the uncertainty interval in this model is small. That is, the probability of an update entered at time x getting a bigger name than another update entered in a different data base manager at time t+At is expected to be very close to zero for Ax> 1 second. b. Johnson's model It is extremely hard to establish the uncertainty period in this model. The only cause of asynchrony is the clock setting difference. It is impossible to give any universal statement without more information on the clock-setting mechanisms of the particular computer systems that we are going to be using. Using the Burroughs B-6700 computer system as an example, we realize that the clock is manually set by the operator. Furthermore, during an operating system reinitialization the clock is reset to a clock value indicated in the boot-strap deck. The operator is generally required to modify the setting. Under these circumstances, it is not hard to imagine how clocks could differ by a few minutes, and occasionally by days. (For example, the operator might forget to change the date in the boot-strap deck.) Our only conclusion is then that the uncertainty of this model could potentially be very large. (A way to avoid very large uncertainties is ■ .1 II .• ! •*«» ■ Mi ,- »<• ■4 I 3 ■$ ."> ■r P 9 <• :•«= !» '.* m #» » -.: r<* a \ « •■ 56 presented in the next chapter.) But it wouldn't be strange to have an uncer- tainty on the order of a few minutes. c. The Reservation Center The uncertainty period built into this model (see section IV. 4) has led us to deal with this issue for the other models as well. The uncertainty is at least as big as the period m. Some minor delays (basically network trans- mission delays), similar to the ones discussed for Bunch's model, also exist. Given the parametric character of m, it is hard to evaluate the possible uncer- tainty of this model. If we were asked to rank the models with respect to the uncertainty issue, we would probably choose the order: Bunch, Reservation Center, and Johnson. Even if for the Reservation Center the uncertainty could be as great as in Johnson's, it is a much more controlled environment. Furthermore, the values that we would typically expect for m are on the order of a small number of minutes, or even a few seconds. IV. 5. 3. 2 How fast are they? IV . 5 . 3 . 2 . 1 Local application delay By "local application delay" we mean how fast can the update take place. This turns out to be an important issue for a distributed data base system. On many occasions a user will make reference to data he just entered or modified. In that circumstance, it is important that the data base manager he is dealing with be capable of applying updates as soon as possible. In the Johnson and Reservation Center models, there is no reason why the corresponding data base manager shouldn't start local application as soon as possible. This is not the case in Bunch's model. In this model, the updates must be sent to the primary first and then the primary will be in charge of the 57 broadcasting. At least two network delays are thus necessary (if the user is not dealing with the primary directly) . Other delays are also present in this scheme; namely, primary queueing time, primary processing time, etc. The local application delay is therefore a potential problem for Bunch's model. In Bunch's model the critical read could aid in this situation, but a more generalized read operation (with an £u parameter; see section IV. 3) would be a more useful tool. If queries and updates are treated using different queues, a similar logical race problem could occur in the Johnson or Reservation Center models. For example, a query to check if an update has been performed might be honored before the update. However, the small domain of each update (which refers only to one entry; i.e., one field) could make it acceptable to process queries and updates in a single queue and avoid this logical race. If we were asked to rank the models again we would end up ranking Johnson and the Reservation Center first and Bunch's model last. ■ * 9 .« ,9 IV . 5 . 3 . 2 . 2 Non-local application delay "Non-local application delay" refers to how long it takes for all the remote data base managers to obtain the updates. Thus, non-local application delay is an important factor because it actually gives an estimate of how up-to- date the data base copies are. Again, Johnson's model and the Reservation Center model have an easier approach. Updates are broadcast to all other data base managers as soon as .*> s 8 See our discussion in IV. 5. 2 or Kleinrock [Kl,K2] or Nay lor et al. [Nl]. 58 possible. One network transmission delay is the nominal non-local application delay. In Bunch's model it takes the same delay for local and non-local application (unless we are dealing with the primary) . Thus two network delays (from user to primary and from primary to backup) , plus the time it takes the primary to process the update, should be accounted for. Again our ranking would be Johnson's and the Reservation Center models first and Bunch's model last. IV . 5 . 3 . 2 . 3 System throughput The amount of network traffic is similar in the three models (i.e., all data base managers get a message for each update and only one of them gets a message for any query). However, the centralized broadcasting of Bunch's model is a source of concern. There is a definite effect on the system's throughput. To make our point clear we will work out an example. Let's assume that we have a network with three sites (si, s2, and s3) , each one of them containing a data base manager (DBM1, DBM2, and DBM3 respectively), and some other background work (Bl, B2, and B3) . DBM1 acts as the primary and contains the complete master copy. The other data base managers maintain their own complete backup copies of the data base. Communication is established by six half-duplex communication lines as shown in figure IV.5-a, line L. . going from site i to site j (in that direction only) . The capacity of these lines is assumed to be the same and will be denoted by H. In the three models some delay in local and destination sites should be considered, but since we are concerned only with comparing the models these delays have been neglected in all three of them. 59 s2 Figure IV.5-a Graphic presentation of 3-node example, si, s2 and s3 are computer systems, L. . is the communications line between system i and j , DBMi is the data base manager located at site i and Bi represents background activity presented at site i. s3 Let's for a moment assume steady state conditions with a constant rate of updates entering each site and requiring .75H to be transmitted. We further- more ignore all other existing acitivity at each site (Bl, B2, B3, queries, etc.). Under Johnson's and the Reservation Center models this situation is perfectly feasible. Each line would be used at 75% its capacity. However this can not possibly be a steady state situation in Bunch's model. Using this model would require that lines L 01 and L 01 will be used at 75% of their capacity, L, 21 31 J 23 and L will remain inactive, and lines L- _ and L, _ will require at least 150% of their capacity! Each of the lines L.. „ and L.. „ would be required to use 75% of its capacity for updates entering at si, another 75% for updates entered at the other manager (DBM2 for L and DBM3 for L ? ), plus some additional traffic 1 ".> < _« r. 1 .. 1 V s* a •M ,'» E J ;* :> m 1 9 XI '.■* r> ■i m 0* m ■■.: .<* 60 in the form of names that are being returned to the corresponding manager. This totals over 150% of the line capacity, with the expected negative consequences on the non-local application delay. Similar results are obtained with a smaller update rate if we assume that there is some activity going on between Bl, B2, and B3. For example, suppose that Bl is transmitting a long file to B2 and B3. Under these circum- stance DBM1 will again have trouble in using L „ and L „. This last effect is also present in the other models but its consequences are much smaller. First, the traffic that originates at DBM1, and in general, the total line traffic is smaller. Second, lines L_ and L„ 9 are usable and would permit DBM2 and DBM3 to obtain each other's updates regardless of activity in lines L „ and L-... Third, DBM1 will obtain all updates from the other data bases. To stress our point we have used high line traffic in the previous examples. We could obtain a similar situation on a larger distributed system with less traffic. The above explanation is somewhat oversimplified but is enough to explain the relevant ideas. For example, we have neglected the network traffic that the positive acknowledgments of a reliable protocol will require. We then conclude that, with respect to the communication lines, the throughput is potentially better in Johnson's or the Reservation Center as compared to Bunch's model. The second factor that we will consider with respect to overall throughput is local overhead. All the models will have some expected delay when there is local overhead due to other programs running in the same system. In all three of them we will expect a slowdown in local operations. We are concerned with the effect that high local overhead could have in the overall data base system. In 61 Johnson's model, high overhead in a data base manager causes practically no symptoms at a second data base manager. Communications from the overloaded data base manager will be slowed down (as well as local update arrival, if a positive acknowledgment is required), but this is about all the effect. This statement is valid for all data base managers other than the primary in Bunch's model, and all but the reservation center in the Reservation Center model. The Reservation Center sensitivity highly depends on the value of the parameter m that is chosen. If m is large the reservation center will seldom go to work and probably the effect will pass unnoticed. However, if m is small the slowness of the reservation center will certainly be noticed since the data base managers will start exhausting their ticket supply and sending messages to the reservation center asking for more. This will make the reservation center increase the allotment of tickets per emission which will probably stablize the situation (but using an "effective" m different from that planned.) Bunch's model is much more sensitive to high local overhead at the primary site. Actually the overall update throughput of the complete distri- buted data base system is highly dependent on the primary performance. We then conclude that the ranking with respect to potential throughput is: Johnson, Reservation Center (very close to Johnson's) and Bunch. ■ S * ■■:. IV. 5. 4 How good are they during failures? The study of failures is a broad and fertile field for research. There are many combinations to take care of. It is difficult to be certain that we have enumerated all of them. We will therefore consider broad classes of failures, and not concern ourselves with all possible events that could cause trouble. 62 We then classify failures as follows: Fatal failures. a. Single failures. i. In hosts and minihosts ii. In the network b. Multiple failures, partitioning. Non-fatal failures. IV. 5. 4.1 Fatal failures This type of failure is characterized by the logical inactivity of the element that fails. a. Single failures i. In hosts and minihosts. A fatal failure causes the corresponding system to suddenly stop any activity. There are two types of problems, internal and external. Internal problems refer to the damage that a system causes to itself and its users. This is an important problem in many current computer installations. Techniques to avoid substantial damage from such occurrences are widely known (Martin [Ml]) and include redundant hardware, journalizing, backups, check points, etc. Our distributed data base study does not give any more insight into these problems. We will thus neglect the problem, and expect the appropriate use of resiliency techniques in the data base managers' design. External problems are a typical example of a new complication intro- duced by a distributed data environment. One must establish what a given data base manager (DBM.) should do when another manager (DBM.) fails. The obvious step for DBM. would be to stop any further communication with DBM. as soon as it 63 is aware of the failure. The three models behave similarly under these circum- stances and we won't enter into the details. In Johnson's model this is practically the only external problem that must be taken care of. In the remaining two models there exists an element of discord. All but one of the possible system falures would cause only the prob- lem mentioned. But in the case of failure of the primary in Bunch's model or of the reservation center facility in the Reservation Center model, we have some additional aspects to be covered. Bunch [B5] describes an election protocol to be carried out when a failure of the primary is detected. (See section III. 3.4.) The process is started as soon as a failure is detected (probably through a time-out of some network message establishing an unsuccessful message delivery due to a dead host) . The election process involves an exchange of messages between all primary candidates and all other data base managers. A few of the disadvantages of having to have such a procedure are: a. The election process is started at the worst moment, when some data base manager is waiting for the primary to return a tag. That data base manager will necessarily have an additional delay for one of its updates. b. The primary recovery procedure presented by Bunch ([B5] Algorithms 9 and 10) are costly in time. After the new primary is elected, it has to establish its working situation: to obtain all pending updates from all the user sites and to establish if names have or have not been assigned, to process any pending updates, etc. c. There is no provision for any unique order for the updates which have not gotten a name. This would greatly increase the uncertainty period tf»JJ f 1 :■ g 64 described in section IV. 5. 3.1. Notice that points A and B also affect the uncertainty, d. The protocol should be enriched with a safeguard against failure while the primary is recovering. Given that there are a few moments where there is no flow expected toward the primary (i.e., the primary has all the facts and is processing them) , some time-out would have to be established for detection of such a failure. The Reservation Center model has a corresponding problem. When the reservation center fails, a similar election must be carried out. There are, however, some clear differences: a. Since at least one extra set of tickets is available at each data base manager, failure of the reservation center will probably be detected before all the local tickets are exhausted. Thus no updates will incur an additional delay to affect the uncertainty period. b. The new reservation center does not need to know precisely the ter- minating situation of the last reservation center. Obtaining an estimate of the highest ticket it generated would be enough. Thus the time required for the new reservation center to start operation would be rather short. (This estimate could be used in the election process, so it would be known to the elected reservation center.) c. Since the updates are expected to keep flowing while the recovery is going on, no measurable difference would be expected in the uncer- tainty period as discussed previously. d. Detection of failure during recovery should be added to the protocol, as in Bunch's case. 65 ii. Single fatal failures in the communications network. Single fatal failures in the network could occur in the communication lines or in the communication processors (CP) . The effect of a fatal failure of a communication line varies with network topology. In the ARPA network all CP's are at least two-connected, and a failure in one communication line will at most affect the network delay. If the topology is star-like or ring-like, the consequences could be more severe. In these cases, the magnitude of the failure must be considered (i.e., whether the failure was only in one direction or in both, etc.). These cases will not be dealt with directly in this thesis. Other cases discussed will exhibit behavior similar to these. Failure of a CP produces for all practical purposes the illusion of a system failure, with basically the same external problems discussed earlier. However, a completely new internal problem occurs. The data base managers attached to the CP find themselves isolated from all of the network. They suffer the illusion that a multiple failure affecting all other systems occurred. This is the simplest case of partitioning, which will be discussed below. If we have to rank the models with respect to their sensitivity to single fatal failures, we would certainly put Johnson's model as the most insen- sitive one. The Reservation Center would go next because of the simpler elec- tion protocol. Finally, Bunch's model is considered to be the most sensitive of the three. b. Multiple failures Multiple failures refer to multiple occurrences of simple fatal failures. In many cases, the consequences of the effect of a multiple failure can be obtained by simple addition of the effects of the individual failures. ■ ■1 3 ■■■ .1* 66 But in other cases, there will be an additional problem due to the multiplicity. We next deal with one such case, partitioning of the network. Bunch [B5] discusses the problem but does not give any solution. In section IV. 4 we presented a possible solution, a minority partition (an isolated set of less than half of the data base managers, according to some weighted criterion) stops any update activity. In the discussion of Johnson's model (section IV. 2) we explained how this model has an advantage with respect to this problem. If we have to rank the models with respect to their insensitivity to multiple failures we would have to start with Johnson's model, followed by the Reservation Center and Bunch's model in that order. This ordering follows the increasing complexity of their behavior during network partitioning. IV . 5 . 4 . 2 Non-fatal failures Non-fatal failures are difficult to handle. There are generally many manifestations of such failures with different grades of difficulty for detec- tion and prevention. If a component or system detects that another component or system is failing, it will be essential to establish which one is really failing the detected or the detector. Some redundancy or double checking would be helpful for such cases. Once a component or system is detected to be failing, we should con- sider how to keep it from causing more trouble. As Bunch [B5] says, how do we make him commit suicide? The failing part may not be willing or capable of committing suicide. Due to the difficulty involved, none of the models directly attacks this area. We could optimistically expect such failures to produce inconsistent information which is easy to detect and discard (i.e., by parity detection, etc.). As mentioned before, we rely on the conventional techniques 67 (parity checking, resiliency protocols [Dl], etc.) to assist us. However we are never certain that a failure did not escape detection. IV. 6 Overall power of the models In the previous section we tried to compare the models under a very strong assumption; namely that the environment was a least common denominator that could make use of any of the given models equally well. The purpose of such a restrictive assumption is clear when we realize the different characteris- tics of each model. What we have done, however, has been to restrict the power of the models in certain respects to suit the common environment. In this section we will eliminate this restriction and tackle the job of exposing the real power of each one of the models. By "power" we will mean the collection of operations that a given model is capable of supporting. Immediately we know that Bunch's model is very powerful. Given that all the data base managers are expected to apply all the updates in precisely the same order, there is no restriction on the type of operations that the model is capable of supporting. The same comments are not valid for the original presentation of the other two models (which we will now call non-primary models) . We question such a design restriction. In this section we will determine whether and to what extent we can relax this restriction. In section IV. 6.1 we will build up a series of definitions and concepts required for our future discussion. We will end up with some idea of how to extend the set of operations of a non-primary model. In section IV. 6. 2 we will discuss some complications which would arise if we actually tried to implement a non-primary model with an extended operation set. I ,e SP 5 \ Ik 68 IV. 6.1 Concepts and definitions IV . 6 . 1 . 1 Notation and basic definitions We start with some notations: An update u is denoted as a triple (op, p, tag), where op is an opera- tion type, p a set of parameters for the given operation, and tag is the syn- chronization tag used in any of the given schemes. The p factor is further defined as a pair (£ , v) where SL is the parameter (or parameters) used to iden- tify the location where the operation is going to take place, and v is the set of values required by the given operation (e.g. the new value to be assigned, or an increment to be added). P will denote the universe of the set of parameters P- A data base db is defined as a collection of bits. Given that data bases are expected to change dynamically with the application of updates, we will denote the state of the data base after i operations have been performed as db., with db being the initial data base. The universe of data bases will be l o denoted as DB. We now start with our definitions: 1) A data base system S is defined as a tuple (DB, P, db , op n , op„, *- o 1 £. op. ... op ) where db is the initial data base for all data base managers, DB _> n o the universe of possible data bases, P the universe of possible parameters, and op , ..., op are the operations available in S. 2) An operation is defined as a function op: PxDB ->-DB i.e. op(p, db.) = db. in l l+l 3) An operation is said to be one-reversible in S if and only if for any p and db. there exists an operation op in S such that db^^ = op (p, op (p, db ± )) = op (p, db 1+1 ) = db ±+2 69 4) An operation op. is said to be n-reversible in S if and only if there exists an operation op.. in S such that for all op~,...,op and all possible corresponding p , p~,...,p , ° P l" * p l» ° P n (P n' ° P n-l (p n-l'"*' ° P 2 ^ P 2' ° P 1 ^ P l' db^)...))) op (p , op .. (p . , op (p 0> . n r n n-1 n-1 r n-2 r n-2 op 3 (p 3 , op 2 (p 2 , db ± ))...))). 5) A journal entry for operation op.(£,v) is defined to be a pair (£,v') where I is the same location indicated by the operation and v' is the value of the location prior to the application of the operation. A system is said to have a modification journal if the journal entry for each operation that is performed is appended to a file called the journal . A system is defined to be j ournal reversible if it has a modification journal. An operation op. is reversed by taking v' from the corresponding . journal entry and storing it at location £. 6) An operation is said to be naturally reversible if it's n-reversible for all values of n. If an operation is naturally reversible or journal-reversible, then it is simply called reversible . 7) Two operations op and op„ are said to be commutative if and only if for all db. and all p. , p , op-,^ (p lf op 2 (p 2 , db ± )) = op 2 (p 2 , o? 1 (p 1 db ± )) 8) If op is commutative with itself (i.e. op. = op = op^ in the previous case) , then we say that op is commutative . 9) An operation op is said to be homing if and only if op (a,v), db i+1 ) = op (U,v), op' (a,v f ), db i )) for all possible operations op' in S, and all db . . ({, is a fixed location and v I * 3 ■1 - £ a and v 1 are arbitrary parameter set values.) 70 Example 1: Assign (£, v) The assign operation stores the value v at location £ regardless of its previous value. Assign is not one-reversible: given that we just performed an assign there is no way of knowing the previous value (destroyed by the assign) by just knowing the update parameters. Assign is not commutative: if we have two sets of parameters (£, v ) and (£, v~) the value of location £ differs if the order is exchanged. That is, it is v after assign ((£,v 1 ), assign ((&, v„) , db . ) ) but is v« after assign ((£,v~), assign ((£,v ), db . ) ) . Assign is a homing operation; regardless of the previous values a location £ will have the value v immediately after the operation assign ((£,v), db . ) is performed. Example 2: Iadd (£,v), Isub (£,v) . The Iadd (Isub) operation adds (subtracts) the value v to (from) location £. We assume that v and the contents of location £ are both integers. Integer addition is reversible in S(DB,P,db , Iadd, Isub). It's sufficient to subtract v rather than adding to obtain the desired result; i.e. Isub = Iadd . (An alternative system could use only Iadd by defining Iadd (£,v) = Iadd(£,-v).) Integer addition is also commutative in S, as is known from standard algebra. Iadd is not n-reversible in S(DB,P,db , assign, Iadd, Isub) for n>l. Once an assignment occurs it is impossible to undo any Iadd, and we haven't assumed any way of knowing if such an assignment has occurred or not. However, Iadd is one-reversible in the same system. IV . 6 . 1 . 2 Operation power The operation power is defined according to the number of operations required to modify any one field of a given logical record. An operation will be called a micro-operation if it must be used in conjunction with other micro- operations to modify a single field, a mini-operation if it alone can modify a 71 field, and a maxi-operation if it can modify more than one field (in one or more records) . A classical example of micro-operations arises in updating a chained element. The series of updates indicating every single change required to unchain and re-chain the modified element would he called micro-operations. Generally speaking, the use of micro-operations is not advisable in a distri- buted data base environment. This is because after the application of a micro- operation the data base could be left in an inconsistent situation (in the middle of changing a pointer, for example). Some locking mechanism must then be used to prevent interference between updates. If the application requires such micro-operations, it seems reasonable to extend the update to a collection of micro-operations packed together and specified to be uninterruptible. In this case we could as well reclassify this update as a mini- or maxi-operation, as appropriate. In the next category, mini-operations, we assume that only one update is required to modify any single value. This does not imply that chained elements are not permitted. The local data base managers are expected to have enough intelligence to infer such modifications, or the update is expected to contain a series of micro-operations which will have to be done in an uninter- rupted sequence. (Implicit or explicit locking might be required.) As an example of updates in this category, we mention the ones used in Johnson's model [J2]. The assign (&,v) operation described previously is a mini-operation when I identifies exactly one logical record. Finally, maxi-operations have the power to affect more than one record at a time. Many examples can be found in the relational data base literature. One specific example is an update that changes all green elements to blue. This update could also be performed as a collection of mini-operations, each one i 9 -■■■ •3 : 72 indicating the change in the color field for a single record. However, this is an extremely dangerous solution with side effects similar to those mentioned for micro-operations. Suppose that while we are changing the individual elements from green to blue we get another update indicating a change of all blue ele- ments to red. We could then turn out with some of the green elements not converted to red, since when the second update arrived they were still green. Again, some implicit or explicit locking mechanism is required to ensure consistency. IV. 6.1. 3 Theoretical systems SI. Single homing, non-reversible operation . We start with a system S1(DB, P, db , opl) where opl is assumed to be a homing, non-reversible opera- tion. (The assign operation mentioned earlier is a good example for opl.) The reader should notice that we use opl to indicate an operation type , while op. earlier denoted one of a sequence of operation applications.) Given that opl is not commutative, a precise and unique order must be established. Synchronization tags or any other mechanism that guarantees this order is vital for SI to be feasible. If all the opl operations can be ordered for application at the different sites in precisely the same order (Bunch's model) , there is not much left to say other than that each data base manager is expected to perform them in that precise order. If, on the other hand, arrival order is arbitrary but tags are available to sequence the operations uniquely (Johnson's or the Reservation Center models), we can handle SI if opl does not modify the field or fields that are used to select the records. In this case we can store the tag (G) associated with each instance of opl together with the field it modifies. At the arrival of an update u, the tags can be compared. If the tag of u is smaller than G, then we can forget about the update u. The homing property of opl guarantees that we will end up 73 with the same stored value as if u had been applied in its proper sequence. If, on the other hand, the tag of u is bigger than G, we perform the update and replace G by the tag of u. If opl modifies the field (s) that are used to select the records some problems arise. For example, let's assume that opl is a maxi-operation. The maxi-operation opl(£,v) is expected to have the power to refer to more than one record. This could be done explicitly (I = set of addresses) or implicitly (£ = description of characteristics). If we use implicit locations we could have problems during the application of updates. For example, if an update u- calls for all things that are green to be changed to blue, and update u_ requests that a specific part (PA) be changed to green, then the outcome depends on the order in which the updates are applied. If u. is executed first, PA will turn out to be green; otherwise it will come out blue. A similar situation will occur when explicit locations are used (i.e., the maxi-operation resembles a collection of mini-operation pasted together), but in this case problems arise in the emission of updates. The emitting data base manager will have to wait until all pending updates arrive, or risk that some of the missing updates will effect changes that would have modified the update. It should be noticed that not waiting will not cause a consistency problem in the data base (i.e. all copies will eventually be consistent, as opposed to what happens when implicit locations are used), but the data might turn out to misrepresent reality. Waiting is not realistic since we could end up with all the data base managers waiting for the others to act. The problems that arise when operations are allowed to modify fields which are used by other updates to select records will be more extensively discussed in section IV. 6. 2. S2. SI plus delete and create . One of the restrictions of Si's definition is that its size has been established to be static; i.e., there is no ■ ' ,e c ■ if 2 - -• " 74 provision to create or delete records. We now solve this defect by defining S2. To obtain the system S2(DB,P,db , opl, create, delete) we have added to SI two additional operations, a create operation which generates new records for the data base, and a delete operation which eliminates them. Basically all comments about opl in SI are valid for S2 as well. The create operation could be made explicit or implicit as discussed in section III. 2.2. In this latter case, S2 is reduced to S2(DB,P,db ,opl, delete) . Since creation does not commute with opl, the order of application is relevant. Furthermore, creation is not itself commutative, and we haven't established what should be done when two creations for the same location arrive. It is not clear that ignoring a second create operation is the right decision. Doing so could simply mask an error. Alternatively, we could avoid the occur- rence of two creates by two means. i) Eliminate the possibility of two sites generating create operations by restricting this function to a single data base manager. (This introduces some problems when the creator data base manager is not available.) ii) Eliminate the possibility of two sites generating a create to the same location by preassigning the locations that each data base manager could generate. The remaining problem is with respect to the non-commutativity of opl and create. What should be done when we get an opl for a nonexistent location? The various alternatives include: i) wait until the create arrives, or ii) perform an implicit create, but flag the given record until the create arrives. When the create arrives, just eliminate the flag. The homing charac- teristic of opl makes the initial value of that location irrelevant. The user would be expected to be notified of the flag. The delete(£) operation is assumed to eliminate the location(s) indicated by the I parameter from the data base. Delete is clearly not 75 commutative with any other operation. This introduces potential hazards. If delete is incorrectly executed before the corresponding create, the record won't be deleted as expected. Similarly, if delete is incorrectly executed before an opl operation, we would either end up with the record staying alive (due to implicit create) , or an eternal wait (if we follow the advice of waiting for the create to arrive) . These comments as well as the proper solutions were dis- cussed in section III. 2. 2 when we spoke of the delete operation for Johnson's model. At this moment Johnson's model can be explained in a few words. His model is an S2(DB,P,db , assign, delete) system with operations that do not modify the selecting field. It has implicit creates and deletes handled as described in the previous paragraphs. 53. A single commutative, non-homing operation . We now study the system S3(DB,P,db ,op2), where op2 is a commutative, non-homing operation. This system is probably the simplest one possible. In S3 the order becomes irrelevant and no synchronization tags nor any master-slave relation (Bunch) is required or useful. Mini-operations or maxi-operations could be used. The only possible problem would be if we lose an update. Resiliency protocols [Dl] could be used to take care of this hazard. Data base managers would be expected to perform updates as they get them. Usually the application delay will be small. But if rapid application is not important, the data base managers could be allowed some flexibility. We still have a slight complication. Again, as in SI, we are trapped with a fixed file size. There is no provision for creations and/or deletions. This leads to S4. 54. S3 plus create and delete . S4 is defined as S4(DB,P,db ,op2, \1 1 ■ I ■ 2 ■ 9 ■■■■ ■•« a create, delete) , with op2 as in S3 and create and delete as in S2, 76 Once more the problem is that the create and delete operations do not commute with the standard operation (op2) . The comments about create that were made for S2 are valid here if the create always initializes values to a standard value (zero for standard algebra). For example, Iadd(£,10) will create a record at I with value 10 only if create (£) creates a record at I with value 0. Otherwise some waiting for the create might be required. On the other hand, delete is hard to implement without tags. If we implement S3 and SA without synchronization tags, we cannot pull the same trick we used for S2. When we get a delete, we don't have any way of knowing if updates arriving at about the same time should go before or after the delete. Some alternatives could be: i) never to recreate deleted locations (i.e., we mark the record as deleted and leave it there forever (or for a long while) ; or ii) to use synchronization tags and solve the problem with the a posteriori idea of S2 and Johnson's model, i.e., the utilization of Johnson's A matrix to establish when a delete has been seen by all the data base managers. S5. Combination of SI and S3 . We will now deal with S5(DB,P,db ,opl, op2) , where opl is defined as in SI (and S2) and op2 as in S3 (and S4) . In this system there are two classes of updates, one for opl's and another for op2's. If, for any given record, we get only elements of one class, we could treat the updates as in SI or S3 (as required) . New complications arise when we mix the classes. Given that opl and op2 do not commute, the order of application is again relevant. If we incorrectly perform an opl before an op2, it will seem that we are in trouble since opl is not reversible. This is fortunately not the case. We just have to remember that opl is a homing operation. An operation which should have been performed earlier can just be ignored. That is, we ignore the op2 and reapply the opl. On the other hand, when an op 2 operation is incorrectly performed before an opl operation, we have some minor problems. We 77 have to exchange the order. This would seem to imply that we have to reverse op2, but this is not true. Given that opl is homing, we could forget about undoing the op2 and just reapply it after the opl. In summary, for each given field in the data base, we could have two tag fields (Gl and G2) , one for the tag of the most recent operation of each class. When an opl operation arrives we will compare its tag with the Gl tag and act just as we did in the SI system. If the incoming update has a tag smaller than Gl, we forget about it; otherwise we apply the update and store its tag in Gl. In this latter case we still have to consider the possibility of an op2 operation that arrived previously having a tag bigger than the new Gl. To do this Gl and G2 are compared. If Gl is bigger we are done; otherwise we should look for all possible op2's that should be reapplied. (This implies the existence of a journal where updates are saved for future use. In this case if that G2 is bigger we know that there exists at least one entry in the journal which should be reapplied) . In order to make this last operation not too slow a chained list might be used. When an op2 arrives things are simpler. If its tag is bigger than Gl, it should be applied; otherwise ignore it. Under this scheme op2 operations might have to be applied more than once. If this is a big problem, something like a delay pipeline might be advisable to minimize the probability of having to redo op2's. (See section V.2.) If operations which modify the selection field are used, then the implicit or explicit location problems discussed in SI are still with us. Once more we have ended up with a static file size, but this time we will solve it by just saying that the solution is similar to the one given in S2 and S4. S6. A three-operation system . Finally, we will study the system S6 (DB,P,db ,opl,op2,op3) where opl and op2 are as before, and op3 is a commutative A ;}. 9* 2 ■ -■■■ .*: . 78 non-homing operation which does not commute with op2. With trivial changes this system could be extended to S(DB,P,db ,opl,op2, . . . ,opN) , where opl is as before and op2,...,opN are commutative, non-homing operations which do not commute with each other. In S6 we have three classes of operations as compared with two in S5. In general the interaction between opl and the other two will be similar to that in S5 (i.e., we now have 3 tag fields, Gl, G2 and G3, and all checking mentioned for G2 will have to be repeated for G3) . However the interaction between op2 and op3 is important. If the application order at the different sites is not identical, then we require that op2 and op3 are either one-reversible or journal reversible. We have to actually deal with problems of undoing updates that were done in a wrong order. S6 must then be extended to include either a journal or -1 -1 the inverse operations op2 and op3 . (This does not necesarily imply new _ 2 operations. For example, for Iadd we could use Iadd (£,v) = Iadd(£,-v)). When an op2 or op3 operation is found to have been performed out of sequence it should be undone, and the correct sequence should be applied. Otherwise the data bases will turn out to be inconsistent. (Repeating example 2 of section III.l, if we have op2 = increment by given amount and op3 = increase by the give percentage of current value, the order is relevant. The operations op2(£,10) an op3(£,10) will give different results if not applied in the same order). There is still an outstanding question, how far to undo? If we are dealing with mini-operations the problem is greatly reduced, since we just have to undo operations that affected the same field as the out-of-order update. With respect to maxi-operations, it is not clear whether that's enough. We again have problems, according to the way the locations to be affected are indicated in the update (implicitly or explicitly) and we refer to section IV. 6. 2 for the corresponding discussion. mwl 79 Here more than in any other system there is a potential for high overhead for all the undoing and redoing that could take place. A feature (delay pipeline) that could be used to minimize the probability of undoing is discussed in section V.2. In most of the remaining aspects (deletions, creations, scheme to apply opl operations, etc.), S6 is similar to S5. IV . 6 . 2 Evaluation of the feasibility of the addition of operations to a non-primary model IV . 6 . 2 . 1 Extension of operations in a non-primary model In a non-primary model we find ourselves in an environment in which different data base managers experience different update application orders (as opposed to Bunch's model.) The tags were designed to enable the data base managers to establish when a violation of the update order has occurred and to take the appropriate steps to resolve this conflict. In Johnson's model this conflict is easily resolved due to the restricted set of operations available. However, an easy resolution is not encountered in the more general systems discussed in the previous section. An initial solution to the conflict could be, upon arrival of an out- of-order update, to undo all updates that were applied in the wrong order. This undoing represents a possibly substantial increase in processing times and costs for updates. In order to minimize this undesired overhead, a pipeline delay will be studied in section V.2. An arriving update would not be immediately applied. Instead it would enter one side of a pipeline and exit through the other side after a delay d. Only updates coming out of the pipeline would be applied, but before application of any update the pipeline would be examined to locate all other updates that should precede it and they would be applied first. The use of a pipeline delay certainly reduces the probability of undoing, at the :■ ■' 2 I 80 expense of higher local application delay. (See section IV. 5.) It should be pointed out that by using a reasonable pipeline delay, we avoid the bulk of the undoing process, but certainly not all. If a network partition occurs or a host or hosts go down holding the only copies of a given update, it is very likely that very old updates will arrive at each data base manager. In those circum- stances a very high undoing cost would have to be paid, given the large number of updates that would probably require this undoing process. In view of the above comments, it seems reasonable to study whether it is possible to transform (reevaluate) an incoming out-of-order update u into a set of alternative updates u 1 .in such a way that we will obtain the same results that we would have obtained if the updates had arrived in the right order. We will start with the simplest case: we will assume that we are dealing with two updates u and u 1 that should follow that order. However, we assume that for the data base manager that we are dealing with the arrival and application orders have been inverted; i.e., u. has arrived and has been applied to the data base before u 's arrival. The data base is assumed to consist of a o set of records, each record having a number of fields. Each update u, is of the form "For all the records in the data base which satisfy condition C, modify fields F, as follows EL". The fields of our data that are tested by C, (i.e., k k k. F are variables in the Boolean expression C ) will be denoted by C, . We will refer to the instant at which u.. arrives as t-1, and the instant at which u q arrives at t. (u has already been applied at x.) Finally, the instant at which u has successfully been applied is denoted t+1. (See figure IV.6-a). It is clear that condition C. holds for different records at different k instants. We will then call the set of records for which C is true the domain of C, , denoted at an instant I as D (C, ) . 81 u x APPLIED u, ARRIVES I u ARRIVES I APPLICATION OF u ENDS I ■> TIME T-l T + l Figure IV.6-a A time line describing the order of the events discussed in the text. An example at this point might be helpful. Suppose that the data base consists of a set of personnel records, each record having fields for name, SSN, age, salary, and job class. Suppose update u.. reads "For all records such that job class = typist, increase salary by $1000." In this case, F consists of the single field salary . C, consists of the field job class . D(C ) is the set of all records for which job class = typist. Finally each update u, will be associated with a tag t, (t $10,000, decrease salary by $2000." F F Here F =C = salary , and F DC = salary ^0. We immediately see the problem. If update u. is applied (out of order) to Joe's record, and Joe is a highly paid typist, making $9500, then the subsequent application of u would cause Joe an o unfortunate net loss of $1000. What has happened is that Joe's record shouldn't have been in the domain of u , but if u, has been applied by the time t when u arrives, then D (C ) includes poor Joe (or, more precisely, his record). o F F The assumption F lie =0=F Hr simplifies our initial study but will be F F removed later (case IB). Adding the restriction F He =0=F He. leads us to a ° o o o 1 Johnson's type model, which will be studied as case 1A. Finally we will expand our study to include three or more updates (case 2). Case 1 - Assuming F nc =0=F He The data base can be subdivided into six regions (figure IV.6-b) with the following characteristics: Region 1. This region contains the records which belong to D . (C. ) (=D (C..)) and do not belong to D (C ) or D ,.(C ); i.e., those records which are TO T+l O affected only by u.. . Region 2. This region contains the records which should be affected by u.. but do not belong to D (C. ) ; i.e., the applicability of u is only detected after applying u . This implies that for these records u modifies some ■p fields which affect the fields tested by C. (F He ^0) . I o 1 Region 3. This region is basically the inverse of region 2. It contains all the elements which belong to D 1 (C ) but which would be eliminated from the domain of u, if u had been applied first. As before this implies that 1 o F H C ^0. o 1 83 Figure IV.6-b Partition of the data base into six regions (1 to 5 and Rest) . Records in left circle (horizontal stripes) should have been affected by u Q and those in right circle (vertical stripes) by u 1 , if the updates have been applied in the proper order. Figure IV.6-C Situation at instant T. Vertically striped area indicates D _-.(C,). m i f s \ n 9 -••■ ■t : 84 Region 4. In this region we place all records that should be affected by both updates. Thus region 4 contains all records which are in D (C,) and in t 1 D ..(C-), as well as in D , (C ) . That is, the fact that u, has been t+1 1 t-1 o ' 1 applied before u has no effect on the applicability of both updates to these records. But the result of applying the updates may be affected by the inverted order. This is the "classical" conflict area. Region 5. This region is symmetrical to Region 1, but for u . It contains all the records which should be affected only by u and for which the applica- tion of u.. is irrelevant. Region Rest. Records affected neither by u nor by u.. , no matter how they are applied. It is irrelevant in our discussion. We next present algorithm A, which will generate the updates u f .. We do not make any statement about the optimality of this algorithm. This algorithm is to be applied at time t, i.e. upon arrival of u . Algorithm A a. If F He =0 then (Region 3 is empty) generate: 01 = "For all records in the data base which satisfy C PIC. , solve conflict" (covers region 4) else generate: = "For all records in the data base which satisfy 02 b. Generate: C fie. undo u. " (covers region 3 and 4) o I 1 03 = "For all elements in the data base which satisfy C H(C 1 not applied) modify F ". (We apply u to regions 2 and 5 and if u' _ was done then to regions 3 and 4 as well.) 85 c. If F nc.^0 then generate; o 1 04 = "For all records in the data base which satisfy C n(C. not applied) modify F " (we apply u 1 to regions 2 and 4.) Explanation: We have used the condition (C. not applied) a couple of times. This means that for the current state of the record, C. has not yet been applied. F This could be tested by simply choosing any field F* of C and checking whether its related tag is t.. ; i.e., (C not applied) is equivalent to (F*'s tag ± t..). ■p The "conflict solving" of u' is assumed to set the tags of C. for region 4 to read "t ". Similarly, the "undoing" of u' _ must also reset tags from t to 02 >• their previous value. If F He =0 then the application of u can't possibly add or subtract elements from the domain of C , and thus all the elements that must be affected by u only, u.. only, or both of them, are perfectly known. In this circumstance, 01 will handle the "both" case and u' _ the "u only" case ("u- only" has already been taken care of at x) . In u' we vaguely say "solve conflict" because the steps to be taken are a function of the type of operations that we have. In general the conflict could be solved by "undo u.. , apply u , reapply u " . But this might be too extreme for some cases. For example, if u, is implementing a homing operation (e.g. assigning) then the conflict could be solved by ignoring the modifications of fields already touched by u.. . If we have that u and u., are commutative (e.g. integer addition) then u can simply Ol VOO Qt ^ be applied after u. . ■p The above comments are not true for the case when F He 40, Under this o 1 - I condition it might well be possible that u modifies something that would o 86 preclude the applicability of u (i.e. region 3 is not empty). Thus all the records in C He will have to be tested to establish if they belong to region 3 or 4, and in general (e.g. for assignments) u will have to be undone tc be able to establish this situation. This has motivated the u' update. 03 is intended primarily to cover regions 2 and 5. However, if u 1 02 is applied, it will also have an effect on regions 3 and 4. p If F DC =0 we are done after step b, since Region 2=Region 3=0. Otherwise we still have to consider the application of u 1 to region 2 and, because of u' ~, to region 4. u'_ , will serve this purpose. Case 1A. Assuming F Hc F =F Hc F =F Hc F =F ^c F =0 lolloool This turns out to be a logical simplification of case 1. (Note all F the conditions involving F C. in the previous case.) Under these conditions we have that regions 2 and 3 disappear and figures IV.6-b and IV.6-C get transformed to figures IV.6-d and IV.6-e respectively. Algorithm A simplifies to algorithm Al: Algorithm Al : a. generate: = "For all elements in the data base which satisfy 01 b. generate: C fiC-i solve conflict" (covers region 4) u' = "For all elements in the data base which satisfy C He, modify F " (We apply u to region 5.) ol o o If we further assume that all we have are homing operations (e.g. assignments) we simplify things even more since "solve conflict" becomes "do nothing" in u' , i.e. we discard u' m . It is interesting to note that Johnson's model satisfies precisely these simplified conditions. 87 Figure IV.6-d Equivalent of figure IV.6-b for case 1A. Figure IV.6-e Equivalent of figure IV.6-C for case 1A. ■ s \ 88 F F Case IB . Assuming that F He ^F He In general it is somewhat unrealistic to expect that we will always get a pair of updates u and u- as before (case 1) in which u does not modify F F fields in C or C. but u does. This is especially true because we haven't o 1 o r j established any restrictions on the types and order of the updates. It is much more logical to expect that in general updates either do or do not modify fields which are used to test applicability of an update. The latter is case 1A. The former will be covered now, but our discussion of case 1 will certainly simplify our explanation. For this case figures IV.6-b and IV.6-C become figures IV.6-f and IV.6-g. We have three new regions. As was expected, we now have some new considerations; i.e., the application of u.. before u can: a) expand the domain of u by changing some fields in some records in such a way that they will satisfy C even though they did not before u 1 was applied (region 8) , and b) diminish the domain of u by changing some records in such a way that they will no longer satisfy C . (Region 6 contains all records which should be subject to both updates; region 7 contains all records which should be affected only by update u . For these two regions the application of u.. before u has caused the corresponding records to not satisfy C at t.) o As the reader can clearly observe the situation has gotten fairly complex. In case 1 some work was saved by not having to touch all the records which satisfied C He (i.e. Region 1). In a Johnson 1 s- type model further simpli- fication exists since rarely will two consecutive updates conflict. This is because C. refers to a single elements address; i.e., two updates are in con- flict (in C He..) if and only if they refer to the same record. Unfortunately nothing similar can be said here. Condition C Hc^ is now satisfied by records 89 Figure IV.6-f Equivalent of figure IV.6-b for case IB, • Figure IV.6-g Equivalent of figure IV.6-C for case IB. * \9 2 3 S 90 in regions 1, 6 and 7 and some undoing will in general be required to detect this situation (e.g., for assignments). Case 2 . We have been dealing with the case of two updates. We'll now discuss the case of more than two updates. Suppose that a series of updates u , Ujy •••» u have been applied to the data base when our conflicting update u arrives. There are n! ways in which the n unordered updates could be organized. However, our protocol establishes that the application of an update implies that it is manipulated in such a way that it will be equivalent to having gotten it in the right order. Therefore we can assume that the n updates have arrived and have been applied in the right order (order = u.. , u„ , ..., u , with tags obeying t,0) will clearly be superseded by u . We have ended up with a version of Johnson's model. IV. 6. 2. 2 Summary We have analyzed the possibility of expanding non-primary models by increasing the types of operations that they can perform. However, we have found that in general if we allow any kind of operations which modify the fields which other updates use to select their domains then we are forced to solve all order conflicts by a costly undoing-redoing mechanism. If we eliminate the possibility of updates affecting fields which other updates use to establish their domains, then things get simpler for those records which should be touched by only one update (i.e. we don't need to touch a record to correct the order of an update that does not refer to it). However, some undoing might be necessary for records affected by multiple updates. One exception to the last sentence will occur when we have only homing operations, as in Johnson's model. In general undoing is not a cheap solution and thus we should consider minimizing its effects. Pipeline delays (see section V.2) are an interesting alternative to minimize the undoing-redoing business. However, they are not 3 1 ,c :■ ■■■ t 3 5 92 effective protection against the less likely but much more costly cases; i.e. updates that arrive very late and thus produce a lot of undoing and redoing (e.g. when network partitioning occurs, when a host goes down being the sole owner of an update, etc.). IV. 7 Conclusion In this chapter we have made an effort to compare the three distri- buted data base models under different situations (one at a time, in a similar least-common-denominator environment, and in general). Our conclusions are summarized in table IV. 7-1. In this table we have used a three-letter ranking scheme. "A" indicates the model which is best with respect to the issue indi- cated. "B" indicates an average performance, and "C" a relatively bad perfor- mance. The last column of this table indicates the sections in this chapter where the particular issues were studied. Aided by table IV. 7-1, we may state our conclusions easily. The two non-primary models behave similarly in general. When the clocks in Johnson's model are guaranteed to be relatively well synchronized, we might be better off with Johnson's model; but if this is not the case we should seriously consider switching to the Reservation Center model or adding some automatic clock syn- chronization as discussed later (section V.l). The major outcome of analyzing table IV. 7-1 is that there is a drastic difference in the performance of primary vs non-primary models. Basically, whenever we find an A score for a non-primary model we get a C for a primary model and vice versa. Non-primary models are faster and less sensitive to failures, but much less powerful and with a high memory consumption. If we were asked to choose the best model, we simply could not give a single word answer. Our answer would have to be: if your applica- tion is so restricted that a non-primary model is useful and you don't care about memory, then use one of the non-primary models. Which of those you should Table IV. 7-1 Comparison Summary 93 Johnson's Reservation Bunch's See Center Section How well do they synchronize (i.e. maintain the "real" update order) Local application delay Non-local application delay System's throughput Sensitivity to single failures Sensitivity to multiple failures Memory consumption i Operations they support (power) A A A A A C C B A A A B B C C C C C C C A A A = best B = average C = worst 4,5 5 5 3,5 5 2,3,4,5 2,3,4 6 choose would depend on the issues of clock synchronization and sensitivity to failures. Otherwise, use a primary model with a unique update application order. This last choice tends to lose something (throughput for example), but you end up with a much more general model applicable to many more environments than the other two models. Interestingly enough, all the available literature in the file allocation problem (see section II. 2 and appendix A) do not mention any synchronization between the copies. This leads us to assume that most researchers are headed for a Johnson-like model with the corresponding restrictions. In our next chapter we will cover some of the issues left open in our discussion of the three models. We will present a clock synchronization •' I - 1 94 protocol aimed at removing the "C" for Johnson's model in the first row of table IV. 7-1. We will then evaluate the idea of a delay pipeline, which we mentioned so many times in section IV. 6. Clock synchronization and a delay pipeline seem to be the only improvements which can be added to the non-primary models at reasonable cost. The last section of the next chapter will discuss the highly important area of reliability (or resiliency) . Because of the results summarized in table IV. 7-1 and our comments above, our discussion of resiliency will be in the context of primary models exclusively. IP 95 Chapter V. SOME EXTENSIONS TO THE MODELS As an outcome of our discussion in section IV. 7 we are going to pre- sent three extensions to the distributed data base models. We will start (sec- tion V.l) with an extension to Johnson's model in order to solve the clock setting problem. In section V.2 we will discuss the idea of pipeline delays for non-primary models. Finally, section V.3 will be directed to the study of resiliency in a primary model. V.l Clock synchronization for Johnson's model V.l.l General discussion To obtain a software clock synchronizer, some detection mechanism for an unsynchronized situation must be available. Fortunately the inherent charac- teristics of the clocks themselves provide us with such a mechanism. If the clocks are perfectly synchronized, then all the updates we get should have a time-stamp with a relatively old time. That is, if we get an update at time t , G it should have a time-stamp with time t such that t >t and t -t =d, d being T G 1 G I the network delay experienced by the update. Conversely, if we get an update with t_>t_ we know that the clocks are not synchronized. I G Lamport [Ll] presents a method of using the above idea to maintain a synchronized environment. He suggests that, upon arrival of an update, the receiving data base manager should reset its clock to the maximum of x (its own clock) and t +y, where x is the time indicated in the update 's time-stamp and y is the minimum expected delay for the network transmission of the update. If we add to Lamport's scheme the requirement that any data base manager which starts operations should first obtain a message (update, special log-on message, etc.) with a recent time-stamp (from which it can set its clock), we end up with a fairly reasonable solution. 1 ■■■ 5 \ .(* 96 The only problem that we can see in Lamport's scheme is that, even though he starts with a real physical clock, he is slowly drifting away from the real time in the positive direction. A fast clock would cause the complete set of clocks to behave as fast as itself. This drifting is inconsequential as long as every clock drifts similarly, which is likely in a majority of cases. Unfor- tunately, uniform drifting does not happen in every possible situation. When a network partitioning occurs, as mentioned in various parts of chapter IV, we could end up with various subnetworks, each following a different drifting pattern. If partitioning is considered, we find that the idea of moving away from the real physical time is not appealing. Sites in different subpartitions will generally be unsynchronized. The situation will get worse as time advances. Thus, if partitioning is a serious consideration, we should try to keep the clocks as close to the real time as possible. This should cause time differ- ences between subpartitions to be small, even when a new group of sites comes into operation without any previous contact with the network. The major problem with trying to follow the real time as closely as possible arises when we are faced with having to set a clock backward. If we have to set back the clock of a given data base manager, we are in danger of violating the logical ordering of its updates. Setting a clock back should never bring about the possibility of generating an update with a smaller time- stamp than a previous update. This complication of backward resetting is the main reason why Lamport includes only forward resetting in his scheme. If we analyze the situations in which a clock might need to be set back, we find that a majority of the backward resettings should change the clock by only a small amount. Resetting a clock backward by a large amount should occur infrequently, and generally involve only a single clock. Given the above 97 characteristics, we present a scheme which will provide us with clock synchroni- zation while maintaining the clock time reasonably close to the real time. Our scheme is based upon a resynchronization procedure. This proce- dure is expected to set the clocks within a small margin of each other. After this setting is completed, all data base managers enter into what we call normal operation. All data base managers are expected to be in an alert state during normal operation and react immediately to any clock asynchrony by calling upon the resynchronization procedure to restore order. The main feature of our scheme is the "window". We say that two clock values, t. and t^, are within a window w if |t -t_| TIME "Tj+o; Figure V. 1-a The thick line indicates all the possible values that x„ could take on to be within a window w of V ■> TIME Tj+CU Figure V.l-b The thick line indicates all the possible values that T_ could take on to be within an open window w of T 1* 99 V.1.2 An algorithm for synchronizing the clocks As was mentioned before, the goal of the resynchronization procedure is to put all the clocks within a small margin (optimally zero) of each other. Operationally, the goal is to synchronize all the clocks to be within the small window of the synchronizing data base manager. The procedure proposed to obtain this goal follows (see figure V.l-c): 1. The synchronizing manager will send a "what time is it?" message to all the data base managers and note the time at which the message was sent. (See figure V.l-c, i.) 2. Every data base manager will answer this message as quickly as possi- ble with a "my time is xxx" message. The synchronizing manager will note the time at which these messages return. (See figure V.l-c, ii.) 3. The synchronizing manager will estimate the actual time of another manager by assuming that the network delay D for the round trip (both messages) was spent equally in both directions; i.e. that the differ- ence between the clocks is the difference between the synchronizer clock at the time when the "my time is xxx" message arrived and the estimated clock value of the second data base manager obtained by adding the value in this message (the xxx) to D/2. 4. Once the clock differences for all other working data base managers are established the median value M is obtained. In case there are an even number of clocks, its own clock difference (zero) could be used to obtain a single median. (Thus the smaller of the two middle values is chosen.) 5. All computed differences are checked to see if they are inside the small window of M. If this is the case, the clock of the synchronizer is adjusted by M and the procedure terminates. r • 3 m .■■■> 2 100 i) The synchronizer (DBM) sends a "what time is it?" message to all other managers (step 1). ii) All managers eventually return the "my time is xxx" message (step 2). iii) After proper evaluation of the clocks (steps 3-6) the synchronizer sends "adjust" messages to the managers which require adjustment (step 7). Figure V. 1-c Illustration of the steps taken by the resynchronization procedure 101 If not all the values are inside the small window, the values outside are evaluated to determine whether their discrepancy could be due to an unreasonably long network delay or is actually caused by clock discrepancy. To assist us with this evaluation, a table of expected network delays between the data base managers could be precomputed. New estimates of the clock differences can be obtained by assuming that the network delay (on either the outward or return trip) might have been the expected one. If such an adjustment could put the desired difference inside the small window, one or two additional "what time is it" messages could be sent and step 5 repeated. (For example, suppose that S is 3 seconds, D is 4 seconds, the expected one-way w delay is 1 second, and the difference between the stated clock times is 5.5 seconds, the remote site being behind. The synchronizer com- putes an estimated time difference of 5.5-2=3.5 seconds, which is outside the window. But from the expected delay (1 second) , it notes that the round trip was too slow and there might have been an extra delay in the return. It therefore tries once or twice more to see if D can be brought down to a more reasonable figure. A more sophisti- cated, statistical approach might be desirable.) If the evaluation of step 6 shows that there is no doubt that there exists a significant asynchrony, or after several tries of the "what time is it?" message we still find that the small window criterion is violated, then an asynchrony is declared. An "adjust" message is then sent to the violating clock(s) . The "adjust" message includes a signed value which must be added to the clock to bring it to the median value. A data base manager might have to send an adjust message to itself. (See figure V.l-c,iii.) C 3 S c \ ■i 1 102 After the above steps are concluded, the synchronizing manager becomes an ordinary, working, data base manager and starts acting as such. In step 4 the median was used instead of the mean value because this measure is less sensitive to errors in the settings of a few clocks. For example, if out of 60 clocks all but one have the same correct time and the remaining one is wrong by 20 hours, the mean value would indicate that the estimated time should be about 20 minutes away from the right time. Moreover, all applications of step 6 that successfully make the estimated value of the clock lie inside the small window after a second or third try would surely affect the mean value, but probably would not modify the median (unless the network delays have a large variance, a possibility that requires some study but seems unlikely) . More than one data base manager might decide to resynchronize simul- taneously. If we let them all succeed, a double (or multiple) adjust message could be sent to a given data base manager in order to correct the same dis- crepancy. This would have obvious undesired consequences. To solve this problem, we could allow an additional valid answer to the "what time is it?" message. If manager DBM- , after undertaking the resynchronizing process, gets a "what time is it?" message (from manager DBM ) while in steps 1 to 3, it will decide whether manager DBM or DBM~ has priority according to a pre-established rule of precedence. If DBM has precedence it will respond with a "cool it" message instead of the normal (for working data base managers) "my time is xxx" message. Manager DBM could also take note of DBM ' s i.d. and notify him when he is done with a "your turn" message. (This is clearly optional.) If, on the other hand, DBM 9 has precedence, then the answer should be a "waiting for you" message and all synchronizing activity should be suspended until a "your turn" message arrives (or possibly a given time-out expires) . At this moment the resynchronization procedure will be restarted in step 1. 103 If DBM. is already past step 3 (steps 4, 5 or 6 could be used Instead, but using step 3 guarantees less time lost), then it has precedence and should send a "cool it" message. Note that only one manager can be in this stage, since step 2 involves a delay until all answers to the "what time is it?" message return, and a "cool it" message inhibits any further activity. Step 7 includes the notion of an "adjust" message. Some care should be given to its implementation in light of the problem, discussed before, of setting back clocks. Section V.1.3 will discuss this matter further. V.1.3 General remarks The overall behavior of our scheme is then as follows. Whenever a data base manager enters the distributed data base system it will call the resynchronization procedure to set its clock. Once its clock is set it begins normal operating activity. If during normal activity a data base manager detects a violation of its big window it will reject the corresponding update. This rejection will cause the violating manager to perform a resynchronization. As an outcome of a resynchronization, various adjust messages could be issued. A data base receiving an adjust message is expected to modify its clock by the quantity indicated. If the adjustment is in the positive direction, then the manager should do it without any hesitation. However, if the adjust- ment is negative much care should be taken not to violate the local update ordering. As we mentioned before, clocks would probably only be set back by small amounts of time. In this case the managers could enter an alternative temporary clocking mechanism. This mechanism consists of setting the clock to *- if ■- 2 m 9 "■■ Later in this section we will introduce a condition to determine when a data base manager enters the distributed system. 104 the value of the time-stamp of the last update generated and then stopping the clock. The clock is not advanced until the adjustment has been honored (i.e., the time as determined by the synchronizer catches up with the setting) or a new local update arrives. In the latter case the clock will be advanced one time-instant to provide a new value for the update' s time-stamp. Care should 2 be taken that the update arrival rate is less than one per time-instant. This ensures that we will eventually end this temporary adjustment status. Furthermore, the manager should apply these updates locally but should not broadcast them until the adjustment is completely honored. This is to prevent the generation of multiple reject messages. If the adjust message indicates that a data base manager should set his clock back by a large amount, the chances are that no other manager has accepted any of its recent updates. In this case the manager could reassign all the local tags very easily (without violating local ordering) . By so doing it will actually synchronize them with respect to the updates of the other managers It is also possible that a large setback is needed and more than one manager needs such a setback. In this circumstance, we must either recall the updates that were sent to the other out-of-step managers or wait a long period, as suggested for small setbacks. (There may be other equally unpleasant solu- tions.) Neither of these alternatives seems attractive. The likelihood of having a large setback for multiple managers is small. (We will show how to make it even smaller in the next paragraph.) We will then not feel guilty in suggesting that the long waiting period should be adopted. Note that this is very likely the time-stamp that the update would eventually get if we stopped activity until the clock catches up. 2 Note that if updates arrive faster than the clock ticks, we are in trouble anyway, since there is not a unique time-stamp to put on each update. 105 As mentioned before, it is not very likely that there will be several clocks with a similar wrong setting. However, one single wrong setting could possibly propagate to the full distributed system. For example, suppose that we start with a data base manager with a wrong clock and add managers in such a way that the wrong clock setting always prevails in the resynchronization procedure. Under these circumstances it might be advisable to ask for a quorum before distributed operations are initiated. Thus, a data base manager that comes up and finds that this quorum is not met will simply start working for itself, without broadcasting any updates. As soon as the quorum is met, it could take part in resynchronization and enter into normal operation. V.1.4 Summary In this section we have tried to present a solution to the clock syncrhonization problem in Johnson's model (sections IV. 2 and IV. 7). Unfor- tunately, trying to synchronize the clocks gets rid of the property of complete clock independence which caused the nice behavior of Johnson's model during partitions (sections IV. 5 and IV. 7). Clock synchronization could be obtained by Lamport's scheme [LI]. And in environments where partitioning is not a problem this is an appealing solution, One of the disadvantages of Lamport's scheme is that it is susceptible to clock drifting in the positive direction. This susceptibility makes it unattractive if partitioning can occur. We introduced a second scheme. This scheme should solve the drifting problem, but the problems of setting a clock back had to be addressed. We had to pay the price of a considerably longer protocol. ; at * ■ i " At this moment, detection of a single wrong clock setting would be very easy. Recovery could be accomplished by reassigning time-stamps as described earlier. 106 In general we feel that the clock synchronization problem is solvable. The two schemes presented are good examples of how it can be solved. V.2 Delay pipeline In various places in chapter IV we mentioned a delay pipeline. By a delay pipeline, we mean that as updates arrive they are introduced at one side of a pipeline. After a pre-established delay D they go out through the other side and are ready to be applied. When an update u is ready to be applied, the delay pipeline is checked to see if there exist other updates with tags smaller than that of u. If there are, we take those updates out of the pipeline and apply them in the appropriate order, following them by the application of u. We are concerned with the probability of having to undo an update (for different values of D) , if a precise application order is required. This measure is very relevant for Johnson's and the Reservation Center models for a system such as S6 (section IV-6) , for which there is a potential need for undoing. We will start by analyzing delay pipelines for Johnson's model. In Johnson's model there are two sources that could cause the updates to get out of order: clock asynchrony and network delays. For our discussion we will assume that the clock-time has a normal distribution (which seems reason- able) with mean c and standard deviation a . However, we do not know enough abou c network delay. For the ARPA network Nay lor et al. [Nl] and other researchers (Kleinrock [Kl]) give delays only as a network mean value or as mean values for different numbers of hops. The means are around 50 milliseconds for 1 hop and 800 for thirteen. In order to simplify the approach (since we are only concerned with gaining some insight into the problem) , we will assume (obviously wrongly) that the network delay behaves like a normal distribution with mean n and stan- dard deviation a,. The parameter n, will be given values 10, 100 and 1000 milli- d a seconds, with the hope that the actual behavior lies somewhere near the results 107 obtained. There is a problem with estimating a,, since there is no data. Our d solution will be to assume a large spread. This will be done with care since we do not want the results to be invalidated by the effect of including meaningless negative delays. We will then make n -5a correspond to zero delay. Let's now take a data base manager and assume that his clock is set to time t 1 . We want to evaluate the probability that at time t.+D all updates carrying a tag smaller than x have arrived. This is the same as the probability P(t.,D) that all updates generated up through local time t have arrived at the given data base manager by time T..+D. The distribution of arrival times of / 2 2 updates with tags = i, is normal with mean n,+T- and standard deviation Vc, +a 1 d 1 d c [HI]. In figure V.2-a we show the upper half (P^.5) of the P(x ,D) curve for various network delay parameters. As expected the probability of our data base manager's having received all updates with tag smaller than a given reference time a increases with time. As shown in the graph, the probability of getting all the updates with tag smaller than a n is .5 at time 0,+n, and practically 1 1 id at time a +6 seconds (for the 3 graphs of figure V.2-a). This implies that the inclusion of a 6-second delay pipeline will very nearly solve the problem. Equivalent statements with smaller D could be obtained from the graph. Obtaining results for the Reservation Center model is a very complex problem. In the description of this model (section III. 4), we pointed out that the sets of tickets distributed by the reservation center could be any collec- tion of instructions or numbers. Actually, we also pointed out that in some sense Johnson's model is a particular case of the Reservation Center model (with m=°°) . It is thus very hard, if not impossible, to say anything about such an undefined situation. If we restrict ourselves to situations in which we allow only numbers in the sets of tickets issued by the reservation center, then the situation is » 9 ■ ■■:. JS» 108 + I- CO ii e Vj c o •H C PH CO 01 • 4-» ♦ (0 /-v O CO CM C T3 + • 0) C fc- 2 uses N(l, - N(a,b) d ven in seco m JC 60 + co co a) h .02) and gr ne with del a and b ar sr •> *H 00 + . a) Qj i- Z -H 0) *"** a- e (/) cO CO ca o 1 CM oj co u co cd UJ • 3 m a< (/) > CO iH 3 01 • M x: a> ,o LJ 3 ex 5 to 60 •H cO C U M-l Cn OC -H >rl + K 4J ^ twork delay, tes in order andard devia cm QJ CO *-> + C T3 CO i-H (.01, .002) for getting all u ith mean a and + 2 M-l 3 O ^ co e CD >^ CO 4J i-l 3 -H 4J H 3 O -H X> .JO -h J3 cd u aj3 w cd o en U U -H o a t3 Nld 109 somewhat simpler. However, in order to have some specific results we have to know something about how often a data base manager runs out of tickets and starts using the next locally available set of tickets. In general we do not have this information. Thus, rather than give a distribution of the probability P as above, we will settle for an approximation to the value of D required to ensure an almost 100% probability for P. In an RC(m,n) model we have that each data base manager maintains n sets of tickets locally. In a period of hyperactivity a data base manager DBM. could then be n-1 sets ahead of the other data base managers. DBM. might get further ahead if it is in a location such that it gets the sets of tickets before other managers (e.g., very close to the reservation center). This advantage would probably not be more than 1 or 2 sets (if the network delay is relatively short). Thus, it is almost impossible that DBM. be ahead by more than (n+3) sets as compared to any other manager. If DBM. maintains this advantage, and the reservation center keeps on sending sets of tickets every time period of length m (a very unlikely situation because the "exhaust" mes- sages should speed up its activity), then it should take on the order of (n+3)m seconds (including a few network delays which were assumed relatively small) for DBM. to be sure that it won't get any relatively old update (according to its tag). In summary, a value D on the order of (n+3)m will seem to sufficient to ensure that we will rarely have to undo an update. It has been our intention to discuss the feasibility of using delay pipelines to minimize the probability of having to undo updates in such systems as S6 (section IV. 6). Throughout our discussion we have assumed that all sites are operational all the time (i.e. no failures). Under these circumstances it seems reasonable that a delay pipeline will prove a significant advantage. However, this advantage is at the expense of giving up the A rank in table IV. 7-1 ■ 3 ■■• * : 9 5 ) 110 for the local and non-local application delay. Furthermore, delay pipelines with reasonable D's cannot offer protection for situations in which failures occur (i.e., reality). A computer system that fails while it is the sole holder of an update and then comes up later and tries to broadcast this update will be a real headache because, even if this seldom happens, it would require a tre- mendous amount of undoing. A similar but more severe situation would arise when network partition occurs. V.3 Resiliency V.3.1 Introduction As we pointed out at the beginning, our major goal was to obtain some guidelines for the design of a workable distributed data base system. We have studied the three available models and come out with the conclusion that non- primary models lack power (section IV. 6). Increasing their power is rather difficult if not prohibitively expensive. On the other hand the relative independence of the data base managers (especially Johnson's, see section III. 2, IV. 2 and IV. 5) makes the reliability of these models easily obtainable. (Basi- cally it suffices to provide a straightforward protocol to prevent lost mes- sages.) In general we would expect a majority of applications to be either willing or forced to use the much more powerful approach of a primary model, such as Bunch's. We are thus clearly concerned with upgrading this model to a workable state by eliminating the problems presented in section IV. 3. In summary, we still have to present the guidelines of a workable model. In order to satisfy our goal, we will do so by presenting some models based on the primary-backups ideas introduced by Bunch. Since the system is the sole owner of the update, reassigning the tag might be the right thing to do. Ill In an effort parallel to this thesis we have participated in work on multi-copy resiliency techniques. Most of the concepts that we are going to present here were already presented in a previous report [AA]. In any network it is always possible to have a very high number of simultaneous failures. Once we accept this simple fact, it is clear that we are incapable of designing an error-free system. Any detection mechanism that we know of could be overwhelmed by the proper sequence of errors. Thus, given that perfection is unattainable, we generally settle for minimizing the proba- bility of an undesired failure within the constraints imposed by our budget and other requirements. We thus introduce the concept of n-host resiliency . By n-host resiliency we mean that at least n-hosts must be aware of an update before it can be either acknowledged to the user or applied at any local data base. Thus, in order for an update loss to threaten the distributed system's reliability, it is necessary that n or more hosts fail "almost simul- taneously." It should be noticed that after one or more (but less than n) host failures occur, the distributed system will keep on working (if n or more hosts are still available) and continue the transmission of any given update until it reaches all available hosts. After a single or multiple host failure occurs, we could have a situation in which fewer than n hosts have a message. However, continued transmission of the update to further backups will eventually cause n or more working hosts to have the update. At this moment the "almost simul- taneous" period ends, and the system would be capable (if enough hosts are available) of recuperating from another set of less than n "almost simultaneous" host failures. The n-host resilient models that we are going to present are able to solve most of the problems discussed for Bunch's model (e.g., missing updates; see sections III. 3 and IV. 3); but the network partitioning problem is still with us. i ' 'I • 3 <« 5 ■■■ i,i* " 112 Network partitioning is mainly a function of the network topology. Requiring all pairs of sites to have a communication channel between them (i.e., a fully connected network) would increase the reliability to such a point that we could probably ignore the partitioning problem. This is an extreme and probably very expensive solution. In general we cannot allow the subnetworks to maintain their normal activity, because the data base copies can easily become irreconcilable. For a majority of applications we could probably get along by providing a degraded service. An example of such service is our palliative solution of section IV. 4 in which only the subnetwork with a majority (possibly weighted) of the sites keeps on working and all others restrict themselves only to queries. In the following discussion we will assume that the partitioning problem has been solved (e.g., by full connection, our palliative solution, or some other technique) and restrict ourselves to the other aspects of the resiliency problem. Two models will be discussed next. We will first (section V.3.2) present a broadcast model, which we have developed, in some detail. Then (section V.3.3) we will briefly present the main features of a chained model, developed by Alsberg and Day. Both are discussed in [A4] but some minor changes have been made in the broadcasting model. Finally, in section V.3.4 we will discuss the advantages, with respect to reliability, of such models. The main difference between the two models that we are going to describe is the way the updates are sent to the backup copies. In the broadcast model the primary will send the updates to all the backups simultaneously, while in the chained model the primary sends its updates to only one backup, which in In such a network at least m-1 (where m is the number of sites) simultaneous failures are required for a partition to occur. a 113 turn sends it to the next one, and so on. This one-at-a-time transmission always follows a pre-established order. V.3.2 The broadcast model To simplify our discussion we will describe 2-host resiliency unless otherwise noted. We should point out that generalizing to a higher-order resiliency scheme is expected to be a straightforward matter. In the broadcast model we assume that we divide the network hosts into three groups: users, primary and backups. For simplicity we will assume that these groups are disjoint. Relaxing this assumption and allowing the primary or the backups to be users is a straightforward matter, which in some ways allows some simplifications. The primary and backups together form the set of "server hosts." A user host is a site which is capable of receiving queries and updates from the external world. The primary and backups are the only hosts with a copy of the data base. An explicit linear ordering of these hosts is assumed, with the primary as the first node and the backups following any arbitrary pre-established order. Communication with the user host takes place as follows. The user host sends its updates or queries to any server host. The receiving host will try to answer all queries itself. However, all updates will be sent to the primary. In some cases, the site receiving the request from the user is the primary. In any case, the primary ends up with the update. Then a second stage of the update is initiated: its actual broadcasting to all data base copies. Update broadcasting starts with the primary assigning a sequence number to the update and journalizing it locally. The primary then will broad- cast the update (with the sequence number attached) to all existing backups. w I II* ■ J ■■) ■■■• m ■a .1 ■' 114 (See figure V.3-a-i.) When a backup receives an update it will verify that the update 's number is sequentially correct before considering it for local applica- tion. That is, the backup checks for missing updates. Upon reception of an update, the first backup in the linear order is expected to send an acknowledgment (which includes the update' s number) to the host that follows it in the given linear order. (See figure V.3-a-ii.) All other backups wait passively until the acknowledgment gets to them. The acknow- ledgment includes a counter, the resiliency counter. This counter is going to reflect the number of hosts which are known to have seen the update. Thus, the initial value of the resiliency counter is always 2; i.e., the primary and the first backup are known to have seen the update. At the time that any backup gets an update acknowledgment it will simply pass the acknowledgment to the next backup in the linear order after adding 1 to the resiliency counter. The last backup sends an acknowledgment to the primary. (See figure V.3-a-iii.) This last acknowledgment indicates that the update has been received at all available backups and the second stage of the update application is complete. The third and last stage of the update is its actual application at all data base copies. Given the requirements of 2-host resiliency, no host should apply any update until at least 2 hosts are aware of the update. We thus have to wait until the first 2 hosts in the linear list have the update before actual application can be started. When a backup finds that the resiliency counter has a value of 2 after it has set or incremented it as required by the second stage, it will acknowledge to the user (user ACK) and simultaneously to all the preceding hosts (application ACK) the satisfaction of the 2-host resil- iency criterion. (See figure V.3-a-ii.) The primary or any backup host which finds itself with an insufficiently small resiliency counter will wait until the application ACK arrives before starting local application. (For 2-host 115 update 1: The primary broadcasts the update to all the backups. update i backup x J Jbackup 2 J applic \ J V y fbackupj ii: The first backup starts propagating the acknowledgment down the list of backups, and also sends the application and user ACK. ( user ] V host J iii: The last backup then acknowledges to the primary that everyone has received the update. Figure V.3-a The Broadcast Model for 2-Host Resiliency •■• m I 1 m .1 \ „•* ■i '.« ■-i 9 , i 0* ' .*> •' 116 resiliency, only the primary can find itself in this situation.) All other hosts will start local application as they receive the second stage's acknowledgment . Up to now we have considered normal (untroubled) operation. Real life requires that we allow for abnormal situations. Complications arise when (i) a message gets lost, or (ii) a host is unavailable. i) Lost messages . There are various points in our protocol where a lost message can be detected, such as: When an acknowledgment for an unknown message arrives or when message h+1 arrives while message h hasn't. The following protocol should handle these anomalies. Upon arrival of message h at a backup manager B, B will first check whether all prior messages have successfully been received. If this is not the case, then B may assume that a message has been lost and ask for the retrans- mission of missing messages (probably by the primary) . In the meantime update h is stored in a "waitlist" until the problem can be cleared up. If, on the other hand, all prior messages have been received, then we proceed with the applica- tion of update h (as described before) and we are ready for the acknowledgment propagation. If B is the first backup in the linear list, then it initiates this process by sending the acknowledgment shown in figure V.3-a-ii. Otherwise a waiting period (controlled by a timeout (TIME0UT1)) is started. When acknowledgment h reaches B, B first checks to see if it corre- sponds to a known, processed message. If so, B simply adds its own acknowledg- ment by adding one to the resiliency counter and propagating the acknowledgment to the next host down in the linear list. However, if it represents an acknow- ledgment for an unknown message, B can assume that message h is lost and ask for the retransmission of that message (from the primary or probably from any of the preceding backups in the linear list if the protocol is such that they have it !» 117 journalized) . In this case the propagation of the acknowledgment should he suspended, but a flag (called the f irst-to-acknowledge flag) should be set to indicate to B that it should restart the propagation of the acknolwedgment as soon as message h arrives. A similar action is taken when an acknowledgment for a known but unprocessed message arrives (i.e. one waiting in the waitlist for the intermediate updates) . In this case we will also set the f irst-to- acknowledge flag and suspend the propagation of the acknowledgment until the update can be extracted from the waitlist. Multiple failures could occur; for example, a message asking for the retransmission of a lost message could be lost. Therefore, a pair of timeouts are also included. Whenever the retransmission of an update is solicited, TIME0UT2 is set. Most likely the action that should be taken when TIME0UT2 expires is to repeat the appeal for retransmission. As mentioned before, TIME0UT1 is set to wait for the acknowledgment of an accepted (correct sequence) message. All but the first backup in the linear order will have a similar TIME0UT1 mechanism. However, the length of the timout might change, increasing as we move down the list. When a TIME0UT1 expires, a search for the acknowledgment will be started by sending an "ACK SEARCH" message to the preceding backup in the linear order. Reception of an "ACK SEARCH" message for an unknown update will also help to detect lost messages. Reception of an ACK SEARCH for the known message will produce one of the following responses: 1) the retransmission of the acknowledgment (if the acknowledment was sent and lost in transmission) , 2) propagation of the "ACK SEARCH" message in the direction contrary to that of the propagation of acknowledgments, or 3) no action because the corresponding message is either in the waitlist or (was lost and) is already going to be retransmitted. a I , it* ■A ■ -r. 118 A similar time out mechanism (TIME0UT3) will be useful to detect lost application acknowledgments. ii) Unavailable hosts . Partitioning problems were discussed in section V.3.1; they will not be considered here. We will concentrate first on how the system recovers from a single backup failure. Primary failure and multiple failures will be discussed later. When a backup copy fails, the primary should be notified. In many cases the primary itself will be able to detect such a failure, when it is unable to communicate an update to that backup. (See figure V.3-b-i.) Once the primary has such information it will send a special update, a failure request, using the same protocol used for any other update. When each backup gets this update, it will modify its linearly ordered list. This will cause the two neighbors of the failed backup to bypass it in any future communication. (See figures V.3-b-ii and V.3-b-iii.) As soon as the primary gets the acknowledgment of the failure request message the recovery from the failure will be complete. When the primary fails, the first available backup in the linear list will be elected to replace it. The new primary must first find out whether it can initiate normal operation. For example, it might have to check whether a majority of the backups is up, whether there are n backups up, etc. If every- thing is ready for normal operation, the new primary then must find out what the previous primary was doing. If n-host resiliency has been met, there is always a backup that has seen any udpate received by the primary and acknowledged to the user. Thus, either the backup will retransmit the update to the new primary, or the user will get tired (timeout) and retransmit it again. In any case, consistency is maintained and the problem solved. Multiple failures are treated similarly to single failures unless n or more failures occur in an n-host resilient model and the primary is among the update ACK 119 i. A backup Host fails failure ACK ACK ii. A failure message is transmitted. update ACK ACK iii. Communication continues after all communication to failed host is eliminated. i :■• ■a -■£ Figure V.3-b. Steps taken to recover from backup host failure 120 systems that failed. It is possible to detect such a massive failure, and a most likely solution is that everybody should stop any update activity until a sufficient number of the failed systems or the failed primary comes up again. V.3.3 Other models Many other variations of Bunch's model (or the Broadcast model) are possible. However, all major points have been illustrated in the previous section (V.3.2). Next we will briefly present the chained model [AA] to show one of such possible variations. In the chained model the interaction with the user is similar to the action in the broadcast model. The difference is that instead of broadcasting the update it is sent only to the first backup in the linearly ordered list. Any backup that gets an update will acknowledge its reception to the sender (backup ACK) and then retransmit it to the following backup (if there exists one) . Upon reception of the acknowledgment (backup ACK) from a backup (or if no following backup exists) the backup will transmit a second acknowledgment (Backup Forward ACK (BF ACK)) to its predecessor. A summary of this message flow is presented in figure V.3-c (similar to figure 3 in [A4]). V.3.4 Conclusion With the introduction of n-host resiliency we have clarified the con- cept of resiliency during multiple failures. We have exchanged the impossible goal of perfect resiliency for the much more reasonable goal of n-host resiliency. However, by requiring that n hosts must cooperate on any action, we open up the possibility that the service may be down for long periods of time, simply because fewer than n server hosts are available. We will next look at this problem for the case n=2. The analysis is largely taken from [A4] . 121 •W X 0) 4J . 0) Q) -C E 4-J 0) XI c CJ -H n T3 T3 0) 0> C c o •H «H CO 4-) 4= C U 0> E d) A U> 4-> C_> < u U 1 o -a CO U-i > n O CO 01 -H S >-l <4-l >-l 3 o M , u en co l-i E oi E u-i 3 QJ C/J l-l « u <3 Pn PQ c ! 122 Suppose there are N server hosts altogether. Let A be the avail- ability of any one data base manager. (See section II. 3.) Furthermore, assume that failures are independent. The probability that none of the N data base managers is up is P = (1 - A) N . o The probability that only one is up is N-l P 1 = NA(1 - A) . Therefore the probability (P) that we can't meet 2-host resiliency is P = P q + P = (1 - A) N_1 (1 - A + NA). We can now define the service down-time per day as P times the proper time constant (i.e., 24 hrs., 1440 min. , etc.). The availability A has a more intuitive meaning when presented as down-time per day; i.e., A = (24 - host down-time) /24 where host down-time is given in hours. Using these more intu- itive terms we obtain the results plotted in figure V.3-d (figure 7 in [A4]). More specific results for an average down time of two hours (A = .91667) and one hour (A = .95833) are shown in tables V.3-1 (table 1 in [A4]) and V.3-2 (table 2 in [A4]) respectively. From these results it is clear that, if N is somewhat larger than n, we can safely use an n-host resiliency criterion. After the presentation of the resiliency models it seems that a study similar to the one in sections IV. 5 and IV. 6 is in order. If we begin such a study for the broadcasting model we would realize that, with respect to the topics studied, it is almost equivalent to Bunch's model. The main difference between the broadcasting and Bunch's models is in the area of acknowledgment As in section II. 3 availability is defined as A = up-time/total-time = up- time/ (up-time + down-time) . 123 8 hrs 1 hr >» o o u 8 min © a a> E h c o 3 o fc 1 min c u J) 1) 8 sec 1 sec t4J_2__ *^ ^x / / 7 / 12 3 4 5 Average Down Time per Host (hours/day) Figure V.3-d. Service Down-Time vs. Host Down Time and Number of Hosts - 1 •' ' * 1 m ' ' 3" ■» 3 128 A given site j (which is a source of updates) will contribute positively to AC if Z d' > Z d' and negatively if Z d' < Z d' . kel-{l} Jk kel-{l} ik kel-{l} Jk kel-{l} lk Both cases are possible. In table VI-1 we present C 1 and CL for the five-node example that Casey introduced in [CI]. (See section II. 2 for the data and some discussion about Casey's example.) In the table we have five alternatives for C„ according to which site is chosen to be the primary. The optimum for three of the five primary alternatives is more expensive than the optimum for C . In appendix D we actually prove that the theorems that Casey presents [CI] are valid for the primary model presented here. The only nuisance is that the primary location must be preselected. Future work could be directed toward finding ways to reduce the number of primary candidates. However, given that the problem grows exponentially (like 2 ), the extra factor of n needed to try each site in turn as the primary is not precisely the first factor that we should try to reduce. 129 Non-primary Model Primary Model (Bunch) I (Johnson) p=l p=2 P=3 P=4 p=5 1 960 960 2 972 972 3 1030 1030 4 918 918 5 915 915 12 852 810 822 13 774 876 882 14 726 807 765 15 867 882 837 23 856 822 816 24 730 888 834 25 735 819 762 34 804 816 768 35 729 882 831 45 753 768 765 123 810 870 744* 876 124 762 801 882 897 125 759 732* 813 756 134 756 939 876 759 135 753 870 1014 825 145 705* 801 759 687* 234 760 882 738* 828 235 765 813 876 894 245 717 951 828 756 345 711 876 680* 825 1234 792 933 876 870 891 1235 789 864 807 1008 888 1245 741 795 945 891 750 1345 735 933 1008 753 819 2345 747 945 870 822 888 12345 771 927 iel 939 933 885 882 - * optimum for this mo< I ■■■'i Table VI-1 Comparison between a primary and non-primary model for the 5-node example in [CI]. Entries are costs: C, (I) for the non-primary model and C 2 (I) for the primary model. The column headings "p=i" indicate that the primary is at site i 130 Chapter VII. CONCLUSIONS Mil First of all, we have shown that distributed data base systems are useful and desirable for many environments. A distributed data base system can improve availability and response time. In addition, even when network trans- mission costs are included, a distributed system can be cost-effective (with respect to operational cost). Many researchers, having realized these facts, have made respectable contributions to the relevant literature [Al, A4, BA, CI, C2, C3, C4, D2, L2, etc.]. A great majority of the researchers in related areas assume that all data base updates are sent directly from the site generating the update to all sites which have a local copy of the data base (a non-primary model) . We have shown that such an approach is feasible only in very restricted situations. In particular, an important condition seems to be that no update modify the fields used for the selection of records. For example, "if salary = x, then change salary to y" is not an acceptable update. Our principal goal was to present the characteristics of a workable model. Clearly a non-primary model meets this goal in a very restricted set of situations. However, in those environments in which the power (as defined in IV. 6) of this model is sufficient and the inherent restrictions (e.g., memory) are acceptable we could be better off (i.e., obtain better throughput) by choosing a non-primary model. A Johnson-type non-primary model has the added advantage that it is expected to be easy to implement. A major consideration for the application manager intending to use a Johnson-type model will probably be the synchronization of the clocks used by the different data base managers. Clock synchrony seems to be, in general, a desired feature. However, tolerance to clock asynchrony is very dependent on 131 the particular application. It is easy to imagine an environment in which some sites should have higher priority than others. This could be achieved by setting the clocks purposely wrong, giving a higher value to data base managers with higher priority. If clock synchrony is desired, we could make use of one of the follow- ing options: a hardware solution, the Lamport scheme (section V.l), or our syn- chronization protocol of section V.l. Alternatively, we could avoid the need for synchronized clocks by using the Reservation Center model. A hardware solution could be obtained by using redundant clocks. (Care must be taken to protect against wrong clock settings by operators.) Any of the other three alternatives are more sensitive to failures, such as parti- tioning. Each of them has some advantages and disadvantages which could help an application manager make his choice. Lamport's scheme is easy to implement but is susceptible to clock drifting in the positive direction. Our synchronization protocol of section V.l is more complex to implement, has some unappealing ways of correcting situations in which multiple clocks are incorrectly set back, but seems to be capable of maintaining the clocks in reasonable synchronization with respect to the real time. The Reservation Center model RC(m,n) also has a relatively complex protocol, but obtains a synchronization on the order of mn by eliminating clocks and replacing them by a centralized synchronizer. In short, we have presented several versions of Johnson's model in the hope that a vast majority of the potential users of that model will find one of the given alternatives suited to their needs. m ■-; if Note that maintaining this pre-established clock asynchrony then becomes an important factor. It can be done by a scheme almost like that for maintaining ; perfect synchronization. 2 By order of mn we mean that an update entered mn time instances after another one is with a very high probability going to get a bigger ticket number. II 132 For those applications which require more power, we have presented a primary model (introduced by Bunch [B5]). Under normal circumstances, all the data base managers in a primary model will be able to order the updates correctl) before application and hence to automatically apply them in the correct order. This uniform application order allows any kind of operation to be supported. Thus, primary models are as powerful as we can get. The transition from a non-primary to a primary model for environments that could support either is not cost free. Overall throughput and other factors are slightly degraded (sections IV. 5 and IV. 7) to obtain the additional power. In addition to more power, some other advantages, such as better memory utiliza- tion, are gained in the shift (section IV. 7). After convincing ourselves that, in general, a primary approach is the right one, we undertook the job of making it workable. Bunch's model [B5] is very sensitive to multiple failures (e.g., the missing update problem). Making a system resilient grows harder as the number of failures that we want to be able to successfully overcome increases. However, as failures grow more complex, involving more components, the probability of their occurrence is drastically reduced. In order to make a distributed data base system workable we have presented a compromise. Rather than the impossible goal of complete resiliency, we presented an n-host resilient model (section V.3), n being the number of simultaneous failures which would be needed to disrupt the service. A disruption of the service will not affect the consistency of the system; it will only degrade the service (possibly to a query-only state) . Given that the parameter n in the n-host resilient model is arbitrary, the application manager could decide on its value in accordance with his own needs. In summary, we have shown that distributed data base systems are useful. We have presented and evaluated the available models capable of 133 supporting such a system. We have studied the appropriate environments suitable to each one of these models. Finally, we have presented workable versions of each one of the models. Thus, our original goal has been satisfied. With regard to future work, the obvious extension is the actual imple- mentation of a resilient primary model (for example, our broadcasting model). The complete and extensive specification of all the required details of such a model seem a likely step for the near future. It is not expected that this implementation will be a simple matter. Many small problems will have to be solved. We have tried to present all the substantial problems that might arise in such an implementation. However, our limited human nature is capable of overlooking some of them. Some other problems will probably not appear until the actual implementation starts to be operational. A similar statement could be made for the implementation of the clock synchronization schemes of section V.l. Once we succeed with the implementation of a distributed data base system, it would be of great interest to study the interaction between files and programs in such a system. Once the file is in multiple locations it seems very plausible that transferring complete jobs to another computer system will be sometimes economical. Savings could be due to a better utilization of the specific characteristics of each computer in a heterogeneous network [Al, A2] or due to load sharing [Bl]. Network multiprocessing is a possible outcome of such j a study. It is possible that it could prove to be worthwhile to transmit not only complete jobs but incomplete ones as well. We could actually be watching the beginning of a network multiprocessing operating system. Many interesting problems remain to be studied in the general area of : ; distributed data base systems. For example, strategies for improving or opti- ! mizing response need to be developed. In general, the very difficult topic of 'I « I ■■■' m 134 » response time in a distributed data base system has not been extensively addressed. Chu [C5] is the only one to attempt a response-time analysis, but he makes a lot of simplifying assumptions of questionable validity. In the area of file allocation there are many open problems. As we point out in appendix A, Levin's solution of the program-file allocation does not yield a true optimal allocation for the programs. Obtaining some insight into this problem could easily be of some practical value. Many extensions are foreseen in the near future for the various ideas that we have introduced in the file allocation discussion of appendix A. Our suboptimal search algorithm has given some optimistic results, but it needs more thorough testing. The same is true for our a. priori conditions. They have performed very well for the examples presented in the available literature, but a broader spectrum of results is desired. 135 Appendix A. FILE ALLOCATION Given the high degree of interaction between the file allocation problem and distributed data base systems, we find it advisable to include this appendix in order to properly present the state of the art of file allocation. In so doing we will also include some contributions that we have personally made. A.l The Problem Once our availability and response studies indicate that we should take advantage of duplicate files, the immediate question is: Where? Where should the copies of a file be allocated in order to minimize some given cost function? The following lines are intended to assist us in answering this question. In general a file allocation model has two types of transactions: updates and queries. The traffic required for such transactions through a given network depends (among other things) on the number of copies of the files. Since a given update must be seen by (i.e. sent to) all the copies of a file, the more copies we have the higher the update traffic will be. This traffic will be minimized if there is only one copy of each file. On the other hand, the addition of new copies of a file tends to reduce the network query traffic, up to the point where every site has its own copy of the data base and responds to its queries locally, so that there is no network query traffic. Clearly there is a tradeoff. We need to introduce a cost criterion and optimize a given system with respect to that cost. Before we get into the details of how to objectively establish an optimal allocation, we should point out that the optimum is not always the best from all points of view. In real life we should be ready for some human '■ ' ... '•" ' l> ■ ■u; : <. 9 m «< ■•* ri*> *lO> 136 antieff icient subjectivity, such as: I want a copy HERE! I don't want him to have a copy, he is not my friend! Either John or Tony might have a copy but not both! These and other factors force upon us the consideration of restricted environments (section A. 7), if the constraints imposed allow more than one feasible alternative. A. 2 Casey's model and theorems The most common cost criterion used in the literature is an opera- tional cost, in which update and query traffic as well as storage costs are considered. Casey [CI] gives us a typical example of such a cost function: n C = Z [ Z f.d' + A. min d . ] + I a j-1 kei J 3R J kel J kel k Where: I = index set of sites with a copy of the file n = number of sites in the network y. = update load originating at site j A. = query load originating at site j d., = cost of communication of one query unit from site j to site k d' = cost of communication of one update unit from site j to site k Jk o = storage cost of file at site k A. min d = cost of sending query to "closest" copy - or one that it J kel J is cheapest to transmit to y.d 1 ., = cost of sending update to the k copy Once this cost function has been established, the optimal allocation (i.e. the set I which minimizes C(I)) could be obtained by several different approaches. As Casey mentions, this problem is analogous to the factory loca- tion problem found in transportation (operations research) literature. 137 The classical solution of this kind of problem makes use of mixed integer-linear programming. Chu [C5] has presented a zero-one linear program- ming formulation of this problem. In general, this kind of technique turns out to be computationally very expensive. On many occasions heuristics are used, sacrificing optimality for computational cost. For example, Levin [L2] points out that Chu's approach would produce about 9000 zero-one variables with 18000 constraints for an ARPA-like network with 30 sites and at least 10 files. There are inherent difficulties for large systems, but it should be pointed out that this approach is mathematically straightforward. Many restrictions can be added just by appending the right constraints. Casey's alternative to the integer-programming approach is a heuristic search based upon the following two theorems [CI]. Theorem 1. Let d.=d' for all j,k. If for some integer r (X./(r-l) for all j, then any r-site file assignment is more costly than the optimal one-site assignment. Theorem 2. Suppose assignment I is optimal. Then along any path in the cost graph from the null node to the node corresponding to I, cost is monotonically non-increasing from one node to the next. Actually Casey's Theorem 2 is somewhat more general and is not stated in terms of the notions of "paths" and "nodes". These notions belong to the "cost graph" visualization of the optimal allocation problem. (See figure A.2-a.) At level i in the graph we have all the allocations of i copies (each node representing an allocation) with a branch from a node at level i to one at Actually Chu [C5] solves a slightly different problem, but one which is very easily modifiable to the current one. Chu assumes that all transactions (up- dates or queries) return some information to the user and that, if we are dealing with an update, a modified version of the original data is sent back to the data base. 1 «« |.» 1 ■■ ■•■• .•" ■ • i ' >«0 '■' :•'' :.i 138 • (3,4) (1,2,3,4) (2,3,4) Figure A.2-a Graph of all possible allocations among four sites. 139 level i+1 if and only if the latter allocation can be obtained by adding a single copy to the former. Theorem 1 effectively gives us an upper bound (r) beyond which there is no need to search; i.e. a gross stopping criterion. Theorem 2 gives us a more precise stopping criterion. If while following a path we find that the cost increases then no further search along this path will be of any use. In Casey's words, "A computer algorithm can be implemented in several different ways to select file nodes [sites] one at a time up to the optimum configuration. One approach is to follow all paths in parallel through the cost graph, stepping one level per iteration. This method is computationally effi- cient, but may require a great deal of storage in a large problem. Alterna- tively, a program can be written to trace systematically one path at a time. Such a technique uses less storage, but may require redundant calculations since many different paths intersect each vertex". In general we agree with this statement, but we dispute the last sentence. We believe that no node must be visited more than once in a "path at a time" approach. In a later section (A. 6), we will defend our belief. It should be noticed that Casey's solution is actually an intelligent and ordered enumeration that will in some cases result in a complete enumeration. (For example, if the minimum is obtained by putting a copy everywhere, the application of theorems 1 and 2 will not reduce our search) . ■ i 3 • 9 A. 3 Other related work Levin's Ph.D. Thesis [L2] is presented as an extension of Casey's work. In Levin's work, files and programs are separated and a staged minimiza- tion approach is tried. The staging may be described by: I 140 (1) Min file location (2) Min program location (given file location) (3) "\ A V / Min (cost function) \ routing (given file and program location) When we carry out the minimization (3), we get the expected answer: Send all queries to the closest copy, and send all updates everywhere they are needed. To solve minimization problem (2), Levin assumes that storage costs for programs are zero. This implies that we store programs everywhere we are allowed to (and can find any slight advantage to doing so). Finally, minimiza- tion (1) follows Casey's ideas. It is our belief that the idea of separating files and programs could be useful in the light of Alsberg's [Al] discussion on disparity of processing costs in the ARPA network. Levin goes on to use similar ideas to describe other environments where the transmission rates are not known or when they change dynamically with time. These seem to be valuable considerations. Going into the details of Levin's procedure, we find the minimization (2) is a little tricky. It actually solves the problem by avoiding it. The assumption that program storage has zero cost (regardless of the rationale behind such a decision) implies that programs are basically stored everywhere. In this circumstance, program locations are chosen a priori by the designer. This contradicts Levin's argument that it is necessary to minimize the number of copies of a program due to the problems of maintaining updated versions of a program in an heterogeneous network. Furthermore, this type of program alloca- tion could be obtained immediately from Casey's model if some additional pre- computing is done to reflect a change in the routing technique. Each 141 transmission from one node to another will have to touch at least one of the preestablished program sites. Thus, if the original d.. does not touch any such sites, then it should be modified to d . . = d.,+d, . where k is the program site ij ik kj which minimizes d. .. Urano et. al. [Ul] establish a concept of proximity. Suppose that two sites A and B have the following property: Given that we have a copy at one of them (let's say B) the cost of referring to B every transaction that arrives at A is smaller than the cost of having a second copy at A. In this case, no optimal solution will include both copies. To state this result formally, first define P as x n 2 Y.d'. + o- P = i-1 i 1X x x _6 = Z x - 6 n q Z X. i-1 X where <$ is the relay cost at x (cost experienced by site x if the only thing it does to a message is to send it somewhere else) , Z is the storage and update cost of having a copy at x and Q is the total query load. Then: [Urano 's Property 2], In an optimal allocation there is no pair of sites (A,B) such that P a >d ab* That is ' if P AB ~ Min (p A' Pb )>M3x (d AB' d BA ) then A and B w111 never simultaneously be in an optimal allocation. The above result follows from the fact that if Q(C + 6 ) is less ! xy x : than Z then it is cheaper to send everything to y rather than have a second copy at x. <«oft • A 3 ■■■ ■ i " Some additional assumptions are made in [Ul]. We omit them because no clarity is gained by including them. 142 This approach seems useful when we have a large number of sites. The exhaustive search of an n-site network would require the evaluation of up to 2 costs. But suppose the sites are partitioned into proximity classes with P. elements each (l_ Z A A.5-i a bM ab a That is, site A should unquestionably be included in any optimal allocation if the cost Z of having a local copy is smaller than the minimum cost of having to send the queries elsewhere. A formal proof of this condition is given in appendix B. This very simple condition would have actually reduced Casey's 5-site example [CI] to a 3-site one. In Casey's example, site 4 and site 5 should be unquestionably included. For site 4 the left side of A.5-i is 144 as compared to 126 for the right side. Similarly, for site 5 the left side is 144 as com- pared to 123 for the right side. 1 ■ .-.» .»• ■ J. ,'■ •■■ •■" . -■' 1 II ' p ii.' m :■■' • ' ■ ■-. ■ '' . ■■' j.i# .,■ 144 « t Given that we know that sites 4 and 5 must be included, it is only a question of deciding whether to include sites 1, 2 and 3. Thus, a program similar to the one presented in appendix C will have to evaluate costs for only 3 5 7 nodes (2 -1) as compared to 31 (2 -1) if this condition is not used. Condition 2. Unquestionable exclusion . A site A will not be included in any optimal solution containing more than one site if: Z A > W B (MX d BC - d M ) A.5-11 That is, a site A should not be included in an optimal allocation if the cost of having a copy at A is bigger than the greatest possible savings that we could obtain. A formal proof is provided in appendix B. Conditions 1 and 2 can be summarized as follows. Let m . = A . min d . . , and n M. = 2 A. (max d., - d. .). 3=1 J k J J Then for each i the real line is partitioned by m. and M. into three regions, as shown in figure A.5-a. If Z. falls in region 1, then it should unquestionably be included in any optimal allocation. If it falls in region 3, it will gener- ally be excluded. An exceptional case is when all sites fall in region 3; i.e., satisfy inequality A.5-ii. In this case, all optimal allocations will be single-copy ones, and we need only choose the best of these. Sites whose costs Z. fall in region 2 (including the boundary points) must be considered further. Region 1 m Region 2 Region 3 M Figure A.5-b Partitioning of real line into three sections using m. and M. as delimitors l i 145 One other comment should be made. If network transmission costs are independent of the sites involved, so that d..=d for all i^j , then m.=M.=A.d, and region 2 collapses to a single boundary point. Unless the cost Z. of some site lies on the boundary, every site is a. priori either included or excluded. Immediately we see that then the set of sites in region 1 forms the optimal allocation, unless the exceptional case noted above obtains. Computation is therefore reduced to a minimum (essentially n cost evaluations) for many real network environments. (For example, most commercial packet-switched networks charge per packet, irrespective of the distance the packet is to be sent.) There are, of course, real cost differentials incurred in the longer lines between sites, but from the user's point of view it is the network charging policy that matters. Condition 3. Neighborly excluded . If we have two or more sites that are relatively close, this condition could guarantee that one will always be excluded from an optimum. Suppose that for two sites (A and B) all possible savings that could be obtained by using B instead of A are offset by B's being more expensive than A. That is E X. f (d.,. - d. tl J < Z^ - Z A where j ' = {j|d <1) Cd U cd O O. 1 O vO r-\ 4J • ■H C < cd cu •ri CU cu O u 4J -H 3 •H 4-1 60 CO <4-4 •ri 1 CU fe U 3 J-< O O IW M-l IH CU O CU u 4-1 4J c cu cd a CU toO 4J 13 G Cd -r) M M cd (U ftf 1 ■'■ •1 3 " 150 the query transmission savings that a new allocation (adding or changing a site) will produce (by comparing the values of this vector with the cost that must be paid if the queries are sent to the new copy) . These vectors increase our 2 2 memory requirement to (n +n) , or simply to 0(n ), but the run-time could be substantially reduced. hi 1 1 ' A. 7 Restricted environments In many real life situations external conditions impose restrictions on our models. For example: to improve availability we could be required to have more than a certain number of copies regardless of cost. Security con- siderations could forbid one site from holding a copy while requiring another to have one. The average response time for queries might be required to be lower than some upper bound. For our discussion we will divide the possible constraints into 3 categories: (i) level constraints, (ii) site constraints, and (iii) other constraints. (i) Level Constraints. If the levels (specifying number of copies) for which the optimization is to be carried out is restricted, (for example if availability considerations impose a lower bound on number of copies), then the techniques described before are not always useful. If we could establish that the overall (global) optimum has not been eliminated by such a restriction, then all the results given in previous sec- tions are valid here. Actually, some minor modifications could make the program in appendix C work more efficiently in this case by starting the search at a higher level. When we know that the restricted optimum is not a global optimum, then Casey's theorem 2 is useless. However, conditions 1, 2 and 3 could be still used if they do not drive us out of the feasible space (i.e., do not 151 unquestionably exclude more than the number of nodes that should be excluded) . But even if they drive us out of the feasible space, this fact could give us a great deal of information. (That is, if the number of sites "unquestionably excluded" is too big, we know that all the others should be included and we should choose the remaining ones from the unquestionably excluded ones) . In general we do not know if the global optimum is included or ex- cluded in a restricted problem. In these circumstances, an exhaustive search of all feasible solutions might be needed, especially if the feasible space is relatively small. Alternatively, we could try one of the heuristic techniques in (iii) , below. (ii) Site Constraints. If we are given this type of constraint, we should in general be thankful. These restrictions tend to reduce the size of our problem. Actually, conditions 1, 2 and 3 and Urano's property 2 impose this kind of restriction; and by so doing reduce the problem. (iii) Other Constraints. Casey's model is not suited to all kinds of restrictions. If response time is a concern, it is not immediately clear how to incorporate it. Furthermore, if we follow Chang's idea [C3] of having a non- linear cost function or Levin's suggestion [L2] of a dynamic query and update load, Casey's model starts to look inadequate. Rather than destroying Casey's model we will describe an heuristic algorithm that could be used in a few of these cases to locate sub-optimal solutions. Other cases will be briefly discussed in our conclusion section. We take this approach because we believe that Casey's model is adequate for most purposes. In many circumstances, Casey's approach could be unreasonably expen- : sive even with the assistance of the theorems and conditions given above. In i i ! many other cases, due to imposed restrictions, it might be impractical to try I 3 '■,: w , ■ •"- II 152 >, ( Casey's approach. In all these cases, we could use the following algorithm that will lead us (in general) to a sub-optimal local minimum. The algorithm is based on the idea of letting the cost "roll down" along a path in the cost graph, up to the point where no more rolling down is possible. At that moment we are at a local minimum. To implement the algorithm we start with any feasible node and start testing the neighboring feasible nodes until we find one with a lower cost. At that point we roll down to that node and start all over again, up to the point where no more rolling down is possible; i.e., all feasible neighbors are more costly. Then we are at a local minimum. It is possible that the cost of a node will be evaluated more than once. However, a roll down to a node that has been evaluated more than once can not occur. Thus there is no danger of an eternal loop. For example, consider the 3-site network of figure A.7-a. Ill 101 100 110 Figure A.7-a A piece of a three-site cost graph. Costs C are such that C(A) > C(B) > C(C) > C(D). D is a local minimum. Allocation A corresponds to a file at sites 1 and 3 (101) , B to a file at the 3 sites (111) , etc. 153 If in figure A.7-a we start with node A (copies at sites one and three) and follow the search through B, C and D, we would probably require that site A be reevaluated when we are at site D in order to establish the local optimality of D. The reader could then argue that this method could be very costly, since nodes might be evaluated more than once, and he is right. To avoid this we could have a flag for each node, and turn it on when an evaluation is made. This approach could be costly in memory. (2 flag bits might be required.) Alternatively we might try to reduce the probability of duplicate node evalu- ation by choosing the nodes in such a fashion that we always try to go away from recently visited nodes. The following algorithm implements such a search. A. 8 Suboptimal search algorithm The search will be performed by changing the allocation by one site at a time. (If the site has a copy, we eliminate it; otherwise we allocate a copy there). In order to reduce the probability of multiple evaluations, we will try to make this change in every other site before again trying to change the situa- tion of a given site. Thus, when we arrive at a new node, we will try to get as far as possible (up to n nodes away) by modifying all specific allocations that lead us to that node. If we do not succeed in getting very far (so that the probability of a reevaluation is bigger) , it is probable that we are close to the local minimum, and the number of such reevaluations would hopefully be small. In order to do the search in an organized way we will make use of a circular list (CLIST) with all the site numbers contained once and only once in it. We will then go around the circular list trying to alter the allocation ,state (i.e. whether or not the site has a copy) of each site until we reach a ! local minimum. I '1 A ,:■•' ! 154 ■■' given that Z < X. min d AT , and <* < -X. min d AT ,, A A BM AB ~ A BM M then C(I') < C(I). Contradiction!! A must be included. Note . If a small nonzero local query cost is assumed, so that instead of d AA = ° we nave d AA K d An for a H A > then condition 1 bee AA AA AB B^A omes 160 i ». >i Aa( Sa dAB " ^ * Za ' and the proof above goes through with only minor changes. Proof of Condition 2 . Given an allocation I ± with cost C , let I' = I H {a}. Consequently n C(I') = C(I) + Z A + I X.(min d. - min d. ) 3=1 xel J ye I JJ n Then C(I') will always be bigger than C(I) if Z > I A . (min d - min d ) j=l J ye I JJ xel J Now, making a term-by-term comparison, we find that EX. (max d.„ - d..) > EX. (min d. - min d. ) j J C jC jA j J yel jy xel' J x Thus, condition (2) is sufficient to imply that C(I') > C(I). If I = the proof clearly will not go through. That is why we included the phrase "containing more than one site" in the statement of condition 2. Proof of Condition 3. The proof requires two preliminary lemmas Lemma 1 . lf J.Y m j'a-Vb ) C(I A). Proof : As in the preceding proofs, n C(I B n A) - C(I n A) = Z + Z X.( min d. - min d. ) j = l J xelfWTB JX yelPA 3Y and C(I n B n A) > C(I n A)<=>Z > EX.( m in d. - min d. ) (B-b) 3 J yelOA jy xemAHB JX 161 Condition (B-a) guarantees (B-b) if and only if n I X-i(d., A - d , ) ^ E XA min d - min d ) (B-c) j» J J A J B 3=1 3 yelfA jy xelHAfB JX Looking at (B-c) term by term, we have 3 cases: 1) min d. f &.., d,_ xelOAOB JX JA jB In this case the right side of (B-c) is zero and the left is nonnegative. 2) min d xeinAHB Jx jA. In this case both sides are zero. (j does not satisfy the conditions on j'.) 3) min d . = d xeinAOB Jx jB In this case the left side is no smaller than the right since d. A > min d. , J A -yeir^ jy and the second terms (-A.d ) are the same. We conclude that (B-c) always holds J JB and thus (B-a) is enough to imply that C(I A fl B) > C(I D A). Lemma 2 . then Proof: If £ X.,(d., A - d. lT J < Z„ - Z A ., j ,v j'A j'B y B A c(i n a) < c(i n b). (B-d) ■ A ■ n i ■ C(I n B) - C(I n A) = Z - Z + Z X.( min d. min d. ). b A j=l 2 XeiflB JX yEIHA jy Therefore n C(I n B) > C(I fl A)<=>Z„ - Z > Z ( min d. - min d. )X.. B A j=i ye inA jy x£ iob jx J Now if (B-d) holds, then it is only necessary to prove that E Li(d... - d,, n ) > E A .( min d. - min d. ). V J J A ]B -j=l J yelPA Jy xdra JX i 162 I ■■ (•Hi IE As in the proof of Lemma 1, we have 3 cases: 1) min d. 4 d . . , d._ xeLOAnB JX JA J B Then the right side is zero and the left site is nonnegative. 2) min d. = d x IHAHB JX JA Then the left side is zero and the right side is negative (or zero if 3) min d. = d xelHAOB JX JJi Then the left side is at least as big as the right, by the same argument as for case 3) in Lemma 1. Then C(I n A) < C(I D B) if (B-d) holds. Finally, we apply these lemmas to prove condition 3. Since Z A > 0, A — if E A.,(d... - d. lT ,) < Z_ - Z A , then Lemma 1 and Lemma 2 hold. . t J J A j'B 7 B A' Given two sites A,B the optimal allocation may include: i) neither of them ii) only one of them iii) both of them In case i) site B is not included. In case ii) the inclusion of site B is nonoptimal since lemma 2 tells us that if we insert A for B we get a cheaper allocation. In case iii) the inclusion of site B is nonoptimal because lemma 1 guarantees that if we exclude it we get something cheaper. Therefore Site B should never be included; thus it is neighborly excluded by A. 163 Appendix C. PROGRAM TO SEARCH THE COST TREE /* The following p p o g r a *i was written in the algol-like C Uncage o * Bell Labs' U »'4 1 x (PDP-11) system. It is designed to imple- ment the f ' reorder search of the modified cost tree (see fig. A.(j-a) ciscussed in appendix A . ♦ / ^define n b /*test number^/ 1 1 e f i n e n s • I u s 1 6 /» global variables */ douole upCnJLnJ^ /♦update cost matrix*/ q u [ n ] [ n ] /■ / ♦ c u e r y cost m3t rix*/ sicnaLnl/ /*storaqp cost vector*/ zlnJ* /*cost of storing a copy. See appendix /*.*/ slkcobt [nl/ /♦stack entry for allocation cost*/ stkqucstLnplus1lL"ni;/*stack entry to store query cost paid in this allocation (first index) for the query load of each host (second index)*/ int stkvaluefnl/ /*stack entry to store number of host added in this allocation*/ /♦pointer to stack. Equiv. to tree level*/ /♦number of allocations tried*/ /♦(J if detailed listing is desired*/ /♦contains a 1 in allocaFi] i + host i+1 has copy of the file in current allocation*/ s t k valuer n3 / I f ■ v p I / count / yesno/ a I toc^Cni; /* main procedure a a i n ( ) {int n i r ; I I ;» c L n ] , */ /♦equiv. to alloca for minimal alloc*/ j; /*miscell a nea*/ extern double upCnJCnJ* quLnJCn]^ siymaCnD/ zCnJ/ stkcostCn], st k quest Cnplusl D C n 3 # extern int stkv-ituern]^level/allocEn]/count^yesno; double minimum; /♦con tains minimal current cost*/ int pop ( ) ; /* ge or sc /* st mi le nu CO t d a t a ( ) ; int 1 ("do ant C7,d" first s k c o s t C I "J de /♦i ynu w » k vesn ome i n = 999 */ CtS q U ctnt de o); i t i a I i 999; n i m u m = 999999 vel = o; sh(1 ) ; t up/ and sigma. computes z * / t?il ?yes(l) or no(1)\n" ); 7£tion * / /♦cost of null allocation, we use 9 9° ( ; 9 9 •/ /♦ as infinity. ♦/ ♦we stf.rt at the tree's root*/ ♦evaluates alloc* (1) and fills stk... [13 ♦/ , : i, si /♦ main loo^ follows ♦/ whilp (level > u ) i if (stkcostllevelD < stkcostClevel i { if (stkcostClevelJ < minimum) - 13) /♦then we roll down i.e. its tf ttcr*/ /♦then its a new minimum ♦/ 164 U o c a C j 3 ; Oiinimum = stkcosttlevel3; f o r ( j = ; j < n; j = j ♦ 1 ) minallocCj3 = a } it ( st k vd lue C I eve I 3 < n) /*then we can irow one level */ rush (stkvalueClevel3 + 1); else /*ci>n't Trow more thru this branch/ try next one*/ ■Cj = oop(); if (j == n) j = pop(); p u s h ( j + 1); /*we're in the next branch*/ > r it ;■- ■i. Mil else /* > ) /*end of pr int f ( "mi for ( j = if ( m i n p r i n t f ( " \ n >/ * end of / * croceiu ous h ( s i t e) C / * w e are c at locat level is extern a extern i doub le int if ( I eve if ( I eve else cos /* we now for (j = { if e I s no optimum this way. we trim tree and io to next o ranch*/ p o p ( ) ; j = oop()» push(j ♦ 1); / * w e ' r e in next branch*/ wnile and main loop */ nimum cost - Xi for allocation - "/ minimum); ; j < n ; j = j + 1 ) allocCj3 == 1) printt ("%d/"/ j + 1); visited %d nodes" /count); Tif.in procedure */ re push */ int site; iven a site to be included in the allocation by adding it ion level+1 of the stack. Proper cost is computed and incremented by 1 . * / ouule quCn3Cn3/ stkqucstCnoluslJln]/ ? T n 3 / stkcostfni; nt level/StkvalueLnJ/yesno; cost; j ' I < D ) return; / * t h i s haptens only at the end*/ I -= 0) cost =0.0; /*no previous allocation*/ t = stkcostClevelJ; obtain effects of query transmission savinqs*/ (■; j < n; j = j ♦ 1) (level == ) /*no previous allocation */ { cost = cost + auCjDCsite - 13; stkqucst[level+13Cj3 = quCjDCsite - 13; > e /* we find out if we get some savings*/ •C if (stkqucst[level3Cj3 > quCjDCsite - 1 3 ) / * w e do save*/ I' cost = cost - stkoucstClnvelDCj] + qu[j3Csite - 13* s tkqucstClevel + 1 3 C j 3 = quCjDCsite - 13; > else /*no savinas/ «*e keep previous valid cost*/ stkqu est llevel + 1 ] [ j 3 = stkqucstClevel3Cj3; > /*we now know ouery savinqs. we finish our job next*/ level = level + 1 ; stkcostllevelJ=cost+zlsite-13; /*we a d rf update and storaoe cost*/ st k va lue l leve I j = site; allocaCsite - 13 = 1; count = count + 1 ; if (yes no == P ) / * w e> ;>rint detail*/ printfC'we add site ''a at level %d and obtain cost %f\n"# site/ level/ s t k cost C I e ve I 3 ) ; 165 / * procedure pop * / int oopO { extern int stkv/alueCn]/level*alloc[nD/>esno# if (yesno == C) printf("pop\n"); allocafstkvalueLlevel} - 1D = 0; / * w e deallocate*/ level = level - 1 ; return (stkvaluetlevel + 13); ) /* procedure get data */ jetdata () C extern double up[nJ[nJ*QuCn3 r .nT*si'Ti'naCnJ*2 r .n3; /* Here we load data for up/ sigma/ and qu. Then we compute z .* I /* The details of this procedure are of no particular interest*/ /* and ire left out from this listing. */ > ■ m : ■'' ■•:: 166 t •ii j IE IL ir IE Appendix D. PROOF OF CASEY'S THEOREM FOR A PRIMARY SCHEME We now prove the following lemma, analogous to Casey's lemma [CI, p. 620] for a non-primary model. Just as in Casey's model, this lemma leads straightforwardly to the fact that costs are monotonic non- increasing along paths to the minimum. (See chapter IV.) Without loss of generality we will assume that the primary is at site 1. Sites 2 and 3 are any other two sites. We have to prove that if allocation I = {1,2, . . . ,r } is cheaper than allocation I - {2 } and I - {3 } then its cheaper than I - {2,3}. The notation I - {k} has the same meaning as in chapter VI, i.e., allocation I excluding site k. Lemma . If C(I) < C(I - {k}), k = 2,3; then C(I - {k}) : C(I - {2,3}) k = 2,3. Proof. Let X= C(I - {k}) - C(I), and Y k = C(I - {2,3}) - C(I - {k}) We wish to show that if X _> ° and X r> 1 °» then Y 2 1 ° and Y -} 1 °* This will be true if Y - X 2 > and Y 2 - X 3 _> 0. Now X = -a. - ¥d n0 + IX.( min d., - min d.,) 12 j 3 keI-{2} jk Y„ = -a. - yd 10 + EX.( min 12 jk 3^ min d.. ) j J keI-{2,3} JI ^ min d jk' So Y - X- = ZX.( min d - J j J keI-{2,3} J keI-{3} ., - min d., + min d., ). JlC keI-{2} Jk kel JK The remainder of the proof follows exactly Casey's proof; we will not copy it here. Essentially, one sees easily that the quantity 167 in parentheses is positive for all j ; and the inequality Y - X > follows by permuting indices. The path-searching program in appendix C will therefore still work. The primary is fixed a priori, and allocations not including the primary are assigned infinite cost. To find the best (on the basis of cost) site for the primary, we may do the optimization n times, each time assuming that a different site is the primary. Then we choose the optimum site as the one giving the minimum of all minima. This approach expands our problem by a factor of n/2; i.e., we have n problems of size roughly half as big as before. I •8 ■ ■ J | ■:•' urn nun J ■•■.: 168 LIST OF REFERENCES Al. Alsberg, P. A. "Distributed Processing on the ARPA Network - Measurements of the Cost and Performance Tradeoffs for Numerical Tasks," Proceedings of the Eight International Conf. on System Science, pp. 19-24, 1975, A2. Alsberg, P. A. "Space and Time Savings Through Large Data Base Compression and Dynamic Restructuring," Proceedings of the IEEE, Vol. 63, No. 8, August 1975. ■ I! IF"" ■ IE •i ir It Igl A3. Alsberg, P. A., Belford, G.G., Bunch, S.R., Day, J.D., Grapa, E., Healy, D.C., McCauley, E.J., and Willcox, D.A. "Synchronization and Deadlock," CAC Doc. 185 (CCTC-WAD Doc. 6503), Center for Advanced Computation, University of Illinois at Urbana- Champaign, March 1976. A4. Alsberg, P. A., Belford, G.G.^ Day, J.D. and Grapa, E. "Multi-copy Resiliency Techniques," CAC Doc. 202 (CCTC-WAD Doc. 6505), Center for Advanced Computation, University of Illinois at Urbana-Champaign, May 1976. A5. Alsberg, P. A. and Day, J.D. "A Principle for Resilient Sharing of Distributed Resources," To be presented in the 2nd International Conf. on Software Engineering, October 1976. Bl. Belford, G.G., Day, J.D., Sluizer, S., and Wilcox, D.A. "Initial Mathematical Model Report," CAC Doc. 169 (JTSA Doc. 5511) Center for Advanced Computation, University of Illinois at Urbana- Champaign, August 1975. B2. Belford, G.G., Schwartz, P.M., and Sluizer, S. "The Effect of Backup Strategy on Data Base Availability," CAC Doc. 181 (CCTC-WAD Doc. 6501), Center for Advanced Computation, University of Illinois at Urbana-Champaign, February 1976. B3. Belford, G.G. "Optimization Problems in Distributed Data Managment , " CAC Doc. 197 (CCTC-WAD Doc. 6504), Center for Advanced Computation, University of Illinois at Urbana-Champaign, May 1976. B4. Belford, G.G., Day, J.D., Grapa, E. , and Schwartz, P.M. "Network File Allocation," CAC Doc. 203 (CCTC-WAD Doc. 6506), Center for Advanced Computation, University of Illinois at Urbana-Champaign, August 1976. B5. Bunch, S.R. "Automated Backup," in Preliminary Research Study Report, CAC Doc, 162 (JTSA Doc. 5509), Center for Advanced Computation, University of Illinois at Urbana-Champaign, May 1975. 169 CI. Casey, R.G. "Allocation of Copies of a File in an Information Network," AFIPS Conf. Proceedings, Vol. 40, pp. 617-625, 1972. C2. Chandy, K.M. and Hewes, J.E. "File Allocation in Distributed Systems," Proceedings International Symp. on Comp. Performance Modeling, Measurement and Evaluation, pp. 10-13, March 1976. C3. Chang, S. "Data Base Decomposition in a Hierarchical Computer System," International Conf. on Management Data, ACM SIGMOD, pp. 48-53, May 1975. CA. Chu, W.W. "Optimal File Allocation in a Multi-computer Information System," IEEE Transactions on Computers, Vol. C-18, No. 10, pp. 885-889, October 1969. C5. Chu, W.W. "Optimal File Allocation in a Computer Network," in Computer- Communications Networks, N. Abramson and F. Kuo (Eds.), Prentice- Hall, Englewood Cliffs, N.J., 1973. Dl. Day, J.D. "Resilient Protocols for Computer Networks," in Preliminary Research Study Report, CAC Doc. 162 (JTSA Doc. 5509), Center for Advanced Computation, University of Illinois at Urbana-Champaign, May 1975. D2. Day, J.D. and Belford, G.G. "A Cost Model for Data Distribution," CAC Doc. 179 (JTSA Doc. 5514), Center for Advanced Computation, University of Illinois at Urbana- Champaign, November 1975. D3. Dijkstra, E.W. "Co-operating Sequential Processes," Programming Languages, F. Genuys, (ed.), New York, Academic Press, pp. 43-112, 1968. ■ ' i ■1 ■■:. m -: ■-■ El. Eswaran, K.P. "Placement of Records in a File and File Allocation in a Computer Network," IFIP 74, Amsterdam, pp. 304-307, 1974. Fl. Frank, H. , Kahn, R.E., and Kleinrock, L. "Computer Communication Network Design - Experience with Theory and Practice," Spring Joint Computer Conf., AFIPS Conf. Proceedings, Vol. 40, pp. 255-270, 1972. Gl. Grapa, E. and Belford, G.G. "Some Theorems to Aid in Solving the File Allocation Problem," Submitted to CACM. HI. Harris, B. "Theory of Probability," Addison-Wesley, 1966. I ■ * ■ Ml' la Ml) Ml ... Ill •i ' ' iu Jl. Johnson, P.R. and Beeler, M. "Notes on Distributed Data Bases," Draft Report, available from the authors (Bolt, Beranek, and Newman, Inc., Cambridge, Mass.), 1974. J2. Johnson, P.R. and Thomas, R.H. "The Maintenance of Duplicate Databases," RFC #677, NIC #31507, Jan. 1975. (Available from ARPA Network Information Center, Stanford Research Institute, Augmentation Research Center, Menlo Park, CA. ) Kl. Kleinrock, L. "Analytic and Simulation Methods in Computer Network Design," Spring Joint Computer Conf., Atlantic City, N.J., AFIPS Conf. Proceedings, 36, pp. 569-578, 1970. K2. Kleinrock, L. "Scheduling, Queueing, and Delays in Time-Shared Systems and Computer Networks," in Computer Communication Networks, N. Abramson and F.F. Kuo (Eds.), Prentice-Hall Englewood Cliffs, N.J., pp. 95- 141, 1973. LI. Lamport, L. "Time, Clocks and the Ordering of Events in a Distributed System," Massachusetts Computer Associates, Inc., March 1976. L2. Levin, K.D. "Organizing Distributed Data Bases in Computer Networks," Ph.D. Dissertation, University of Pennsylvania, 1974. Ml. Martin, J. "Security, Accuracy and Privacy in Computer Systems," Prentice- Hall, Inc., 1973. Nl. Naylor, W. and Opderbeck, H. "Mean Round-Trip Times in the ARPANET," RFC #619, NIC #21990, Network Measurement Group Note #19, NIC #21791, March 1974. SI. Spinetto, R.D. "A Facility Location Problem," SIAM Review 18, pp. 294-295, 1976. Ul. Urano, Y. , Ono, K. , and Inoue, S. "Optimal Design of Distributed Networks," ICCC Stockholm, pp. 413-420, 1974. LIOGRAPHIC DATA ET 1. Report No. UIUCDCS-R-76-831 ]de and Subtitle CHARACTERIZATION OF A DISTRIBUTED DATA BASE SYSTEM 3. Recipient's Accession No. 5- Report Date October 1976 6. uthor(s) Enrique Grapa 8. Performing Organization Rept. No - UIUCDCS-R-76-831 rforming Organization Name and Address Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. DCA100-76-C-0088 Sponsoring Organization Name and Address Command and Control Technical Center WWMCCS - ADP - Directorate 11440 Isaak Newton Square, North Reston, Virginia 22090 13. Type of Report & Period Covered Ph.D. Thesis 14. Supplementary Notes Abstracts A distributed data base system forms a very attractive solution to various of the ual data base problems in a computer network. Availability, response time, and •rational cost savings make multiple copy (distributed) data base systems not only ractive but economically appealing. The majority of the researchers involved in related work have adopted a decentral- :d inter-copy synchronization scheme which turns out to be very restricted. They ume that whenever a data user wants to perform an update he will send it to every uputer host which possesses a copy of the data base. Given that the order in which ; ates are applied does matter, some means of synchronizing the update application at hosts must be provided. It turns out that the available models which provide de- itralized synchronization can efficiently manipulate only a very restricted set of ]ate operations, mtinued on next page) •ey Words and Document Analysis. 17a. Descriptors :tributed Data Bases, File Allocation Distributed Networks, Computer Networks k demifiers/Open-Ended Te '^OSATI Field/Group Mailability Statement Release Unlimited ^TIS-38 (10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 179 22. Price I 3 it - USCOMM-DC 40329-P7 1 * it !•*■■■■ IBfJ s :: Hill (continuation of No. 16 from the preceding page) Three models for update synchronization are presented and extensively studied. The first is Johnson's model, which basically assumes that up- dates are handled in the decentralized fashion of our previous description. The second is Bunch's model, which introduces the concept of a centralized scheme. In this scheme, all updates are sent to a primary host which in turn broadcasts them to all backup hosts that hold a data base copy. Finally, we introduce the Reservation Center model developed by the author. The Reservation Center model combines various centralized and decentralized concepts. The major flaws of the models are discussed and extensions are pre- sented to cover them. The broadcasting model, an extension of Bunch's model, is presented as a prototype of a workable generalized distributed data base system. After our discussion, it seems likely that a major reorientation should be made in the related literature to cover the centralized synchronization schemes. We have done so with Casey's file allocation model, with minimal consequences to the general applicability of his work. During our study of the file allocation problem we have made some contributions in the form of a priori conditions for the inclusion or exclusion of a host in an optimal file allocation. These results are presented in an appendix. J'*r ftt \ vai I I ■A :i I :'■ 9 ■•'J « :: ;> ;: :» n .» > m :: (3 :: t I' •1 :: 13 i ■ \ :> i i : ■ JftN x 9 A97& «f£; UNIVERSITY OF ILLINOIS-URBANA 510 84 IL6R no COO? no 830 -835(1976 Implementation ol the language CLEOPATRA I 1 Hi i'i J ,; ■ ■ ' : &8 ! i '/' t ' i *» iW