UNIVERSITY OF ■ ILLINOIS LjpRARY Al -URBANA-G^MPAIflN £NGIN£ER|^ N NOTICE: Return or renew all Library Materials! The Minimum Fee for each Lost Book is $50.00. ' ,U|| 4 t% The person charging this nT^eTiaTisHefpntcJible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, nary To rei UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN ShBg [rteasons for discipli- ne University. CONFERENCE RO NGINEERING LIBRARy UNIVERSITY OF ILLINOIS URBANA, ILLINOIS UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA. ILLINOIS 61801 Digitized by the Internet Archive in 2012 with funding from University of Illinois Urbana-Champaign http://archive.org/details/researchinnetwor170bunc CAC Document Number 170 JTSA Document Number 5512 Research in Network Data Management and Resource Sharing Preliminary Experimental System Design Report by Steve R. Bunch David C. Healy Edwin J. McCauley Prepared for the Joint Technical Support Activity of the Defense Communications Agency Washington, D.C. under contract DCA100-75-C-0021 Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 August 20, 1975 Approved for release: _S >J Peter A. Alsberg^^F^Sie*pk^ y .lW7 : estTgaTor Table of Contents Page Executive Summary 1 Background and Overview 1 Related Work 4 Areas of Experimentation 5 Introduction 5 Reliability Improvements Through Automated Backup .... 5 Data Sharing and Concurrent Use 6 Parallelism 8 Load Balancing 9 Conformal Data Structures 10 Data Compression 11 Clustering 12 Integrity 12 Required Attributed of the Experimental System 14 Introduction 14 Transparency of Storage Structure and Data Location to Applications Programs 14 Freedom of Structure Realizations 15 Flexibility in Indexing of the Data Base 15 Simple and Modular Data Model and System 16 Exploitation of Parallelism 16 Simple and Modular User Interface 17 Data Model and Machine Choice 18 Data Model Selection 18 Machine Selection 28 Application to PWIN/WIN 32 Page Experimental System Design 34 System Overview 34 Network Data Manager 35 Local Data Manager 43 Network DBA Support Functions 52 Research Affecting the Experimental System 58 Previously Described Topics 58 Relational Operators Suitable for Network Use 59 Views and Inverse View 60 Dynamic Optimization of Retrieval Strategies 61 Implementation Considerations 62 Manpower 62 Schedule 62 Executive Summary Background and Overview This document presents the preliminary design of an experi- mental distributed data management system. The purpose of the system is to provide a vehicle for research in the areas of network data manage- ment and resource sharing for application to the World-Wide Military Command and Control System (WWMCCS) Intercomputer Network (WIN). The research is supported by the Joint Technical Support Activity of the Defense Communications Agency. The design and implementation of the experimental system will serve as the focal point for the more pragmatic research topics in distributed data management, e.g. synchronization. While some of these topics may be theoretically understood, their implementation in a prac- tical system poses major research challenges. When implemented, the system will provide a controlled environment for experiments. Presently, no data is available for many important parameters describing the opera- tional characteristics of distributed data management systems. The experimental system must include a comprehensive and selective measure- ment capability so that these parameters may be ascertained. To be used in the investigation of a range of topics, the system must be designed to be flexible and modular. From a discussion of the topics to be investigated via the experimental system and their relevance to the WWMCCS community, some of the necessary features of the experimental system are derived. Broadly speaking, the system design must not impose any artifical constraints on the experiments to be performed. Nowhere is the requirement for flexibility more important than in the selection of the data model. The relational model captures the essence of other competing data models while permitting much greater freedom for experimentation. In addition, the relational model appears to be easier to implement and easier to explain to a user. The choice of the machine on which to implement the system is most strongly influenced by the ease of creating experimental software. Multics represents perhaps the best system in existence for such an effort. Many of the most difficult aspects of system implementation, such as memory limitations and file management, are simply not problems on Multics. Since there are three Multics systems connected to the ARPA network, it will be possible to perform multi-machine experiments. For the purposes of this document the system design has been broken down into three separate components: the network data manager, the local data manager and the network data base administrator support functions. The network data manager is responsible for coordinating the activities of the local data managers to answer a user query. The local data manager is responsible for retrieving, qualifying and sorting the data at his site and returning it to the network data manager. The network data base administrator support functions are the "staff" acti- vities of the system (as contrasted with the "line" activities of the network and local data managers) . To have an operational distributed data management system, the network data base administrator must be provided with a data collection and analysis capability. In the longer term many of the control loops which now include the data base admini- strator would be closed by having the system itself make some of the more routine decisions about restructuring, clustering, indexing, etc. It should be emphasized at the outset that the experimental system must incorporate the results of a variety of different research topics. Because many of these topics represent areas of continuing research, no attempt is made to present definitive solutions here. The major topics for continued research are briefly sketched in this docu- ment. For a more detailed presentation of the topics and their inter- action please see the Research Plan (CAC Doc. No. 164, JTSA Doc. No. 5510). It is expected that approximately 50 man-months will be devoted to the design and initial implementation of the experimental system over the next 13.3 months of the follow-on contract. However, difficulties or breakthroughs in the supporting research could sub- stantially alter these estimates. Related Work The project has produced ten other documents in the course of this research. Table 1 summarizes these other documents. CAC Document JTSA Document Title Date Number Number 149 5502 An Annotated Bibliography to Network Data Management and Related Literature April 1975 150 5503 A State-of-the-Art Report on Network Data Management and Related Technology April 1975 151 5504 WWMCCS System Summaries April 1975 152 5505 Survey Report April 1975 159 5506 Scenario Report May 1975 160 5507 Application Summary May 1975 161 5508 Technology Summary May 1975 162 5509 Preliminary Research Study Report May 1975 164 5510 Research Plan June 1975 169 5511 Initial Mathematical Model Report August 1975 Table 1 Supporting Documents Although the reader would be well served by having some familiarity with these other documents, this familiarity will not be heavily relied upon here. The first two documents, the Annotated Bibliography and the State- of-the-Art Report are relatively comprehensive treatments of the existing literature and on-going activity in the field. The reader interested in a more thorough treatment of the concepts presented here is referred to those documents. Areas of Experimentation Introduction The WWMCCS Intercomputer Network (WIN) is a natural candidate for distributed data management. By providing a distributed data management capability, it appears possible to achieve gains in system responsiveness, throughput and reliability. Before proceeding with a WIN distributed data management system, many concepts must be evaluated and experiments must be per- formed to provide the technical base for the design of a production system. In this section we discuss the concepts to be evaluated and the experiments that may be performed with the experimental system. Many of the experimental areas addressed have impact on both WIN and conventional, single-site data management systems. Where appro- priate, this impact on both multi-site and single-site problems is discussed. Reliability Improvements Through Automated Backup A distributed data management system can offer improved reliability by duplication of critical data bases at multiple hosts. When the primary copy is unavailable (host goes down, disc drive fails, etc.) user operations can be automatically re-directed to a backup copy. When the problem is remedied, the original primary copy can be rein- stated. These transitions (primary-to-backup and backup-to-primary) must be carefully designed and analyzed, lest problems such as critical races, deadlocks, or lost updates occur. In the Preliminary Research Study Report (CAC Doc. No. 162, JTSA Doc. No. 5509) one possible scheme was presented in detail. The Scenario Report (CAC Doc. No. 159, JTSA Doc. No. 5506) presents a less detailed treatment of some other basic alternatives for reliability enhancement in a distributed data base. These are analyzed in more detail in the Initial Mathematical Model Report (CAC Doc. No. 169, JTSA Doc. No. 5511). In a 35-node WIN, a host's going down is likely to be a frequent occurence. Thus a reliability improvement scheme using automated backup and recovery is essential. End-users should not be concerned with the mechanical aspects of backup and recovery. They should not be burdened with the necessity to program around failures of the primary copy or even of backup copies. Extensive operator intervention is similarly undesirable. In a 35-node WIN, with only one failure per host per day, every 40 minutes or so a host will fail somewhere. If operators through- out the network must manually switch to backup copies of data bases, they may wind up doing little else but this switching. Such reliability improvements are not without cost. For example, all updates must be performed once for each copy of the data base. This can add significantly to processing loads. Additional problems exist due to the necessity of synchronization to insure data base consistency. The experimental system should be capable of examining various backup and recovery strategies. It must also be capable of controlled simulation of a host failure, since the actual system can hardly be expected to fail at a convenient time. Data Sharing and Concurrent Use A distributed data base provides an inherent data sharing capability. Data maintained at one host can be easily accessible to suitably authorized users at other hosts. Sharing can be enhanced by a single user interface to the distributed data base system. The cumbersome aspects of TELNET-ing to another host, logging in there, and dealing with a foreign data base would then be eliminated. The WIN has a headstart in this area over research and commercial networks because of the standard systems used at each site and the commonality of interest and mission on the WIN. Obviously, data sharing can be enhanced through concurrent use of the data base. The problem is to guarantee that the various con- current users of a given data base do not interfere with one another. A sufficient (but not necessary) condition for non-interference is that only one user at a time may modify the data base. While this approach produces correct results, response time for queries is severely degraded as the number of updates increases. In the WIN environment many large data bases have small sections in which most of the activity is concen- trated. For example, in FORSTAT less than 5% of the data is changed per day and only about 5% of the data changes over a multi-day period. If it were necessary to lock the entire data base every update, intolerable response degradation would result. Thus, a refinement of sharing through concurrent use would be to break the data base up into smaller, more manageable portions, along the conceptual lines of the DBTG* AREA con- cept discussed later. Then many concurrent users could operate on some data base sections while other sections were being modified. However this technique has the problem of increased cost and complexity since more objects must be managed. Deadlock is also possible. * DBTG is the Data Base Task Group of CODASYL, The experimental system should be capable of providing facil- ities for concurrent use of the data base to enhance data sharing. This necessitates an effective synchronization mechanism. Research into synchronization techniques for distributed data management systems is a top priority item. Parallelism The multiple host processors in a distributed data base system offer the possibility of parallelism. Parallelism can greatly improve the responsiveness of the system. It is in this area that a distributed data base system offers the most startling contrasts with existing, single-site systems. The basic idea is to perform activities in par- allel on different hosts. For example, a commonly used component of many JOPS exercises is a large (300 Megabyte) airfield characteristics file. Currently it is a tedious and time-consuming task to process the score of tapes containing the information. If the file were suitably partitioned and located at different hosts, the searching could be accomplished in parallel in a fraction of the time it now takes. It may also be possible to exploit the network's parallelism when processing a query which accesses several files. A different host could be chosen to process part of the query for each different file in the query. Then the separate files could possibly all be processed in parallel. The communication delay to ship large amounts of data across a network may well erase any advantages of parallel searching. Con- siderable study must go into the design of systems to take maximum advantage of the multiple hosts available. The obvious ideal situations are when large amounts of data must be searched for a few hits, i.e., needle in a haystack problems. Real world situations are not likely to be this ideal. In fact, we may not know in advance how many needles there are in the stack. If there are too many hits, the limiting factor becomes network bandwidth rather than host searching speed, and paral- lelism does not help much. The ability to study the effects of alternative strategies for exploiting network parallelism is a central concept in the experimental system design. It is likely to be non-trivial to implement even a simple-minded scheme. A user request must be analyzed for the possi- bility of parallel operation. The advisability of actually performing the operations in parallel must be assessed. Finally, a complex multi- machine task must be coordinated and controlled in a hostile environment where hosts and communication links may fail at the most inappropriate times . Load Balancing By attempting to even out the load imposed on the hosts, a distributed data base system can give far better equipment utilization for the network as a whole. Load balancing is related to parallelism. Both concepts utilize a computer network as a network of cooperating hosts, not simply a disjoint collection of hosts with cheap phone lines to the users. The eventual exploitation of the WIN resource sharing capabilities is a primary goal of the entire research effort. In the present WWMCCS environment, some sites are already incapable of handling their peak loads. A data base user at such a site may be given poor response or even denied service. A distributed data base system would allow the requests from such users to be handled at other sites if there were unused capacity. The user gets his job done faster, and excess capacity at under-utilized sites is better employed. Ideally, the distributed data management system would deter- mine a strategy for responding to a user's request in a minimum time. This would entail trading off the improved response of a remote host against the communication delays to send the request and receive the reply. An important area of research is the estimation of load-related factors like throughput and response time from observable indicators. Initial experiments on the ARPA network have shown that this estimation is a difficult task. Conformal Data Structures A distributed data base system should allow different data structures for different copies of the same data base located at dif- ferent hosts (i.e. conformal structures). Copies could be structured to be optimized for a particular class of operations. For example, con- sider a file used in different ways by several groups. If multiple copies of the file (or portions of the file) exist, each copy can be physically structured in a different way. The distributed data base system would decide which copy has the most appropriate physical struc- ture for each request received. The intent of multiple structures is to minimize the I/O and processing required to reply to a query. However, structures which are optimal for one collection of queries may be decidedly sub-optimal for a different class. As long as each of the copies continues to represent the same information, they remain suitable as backups, so reliability enhancement is still obtained. The question of how each copy should be structured is, of course, of major interest to the network data base administrator. If the secondary, derived data structures such as indices and saved hit files may also vary from site to site, additional dimensions are added to the options available to the data base administrator. 10 Data Compression Compressing stored data is a technique to reduce the I/O required to service a query. The goal is to reduce the redundant or recurring information in the data base. For example, consider a social security number (SSN). If the SSN were stored as characters, at least nine and possibly eleven (there are two dashes) bytes would be required. If however the SSN's are stored as integers only 35 bits are required, a savings of nearly 3 to 1. Data compression can be applied in other ways. It is often possible to drastically reduce the amounts of storage used by compressing the data using statistical techniques. Essentially, we establish a compact encoding for the data to be compressed. In such coding schemes the most frequently occurring data values are represented by only a few bits. Statistical encoding schemes do suffer from three problems. First, values are encoded with variable length bit strings, potentially increasing the complexity of manipulating the data. Second, an auxiliary data structure is needed to decompress the data. Third, the natural ordering of the data may be lost. If data can be maintained and manipulated in its compressed form, additional benefits accrue. In such a scheme, the user's query would be compressed also. All comparisons will be made between the compressed values, speeding up the comparisons. The data would be shipped over the network in its compressed form, which would increase the effective bandwidth of the network. By manipulating the data in its compressed form, overall effective throughput can be increased by a factor approaching the compression ratio. Data compression is a well established technology, and we are not proposing to advance the state of the art in it. The experimental 11 system must be designed to allow compression techniques to be employed throughout. We need to be able to run controlled experiments on the impact and effects of compression. It is also important that the tech- niques for and impact of data compression be disseminated to the WWMCCS community . Clustering Clustering refers to the technique of reducing I/O require- ments by making the physical structure of the data base conform to user access patterns. For example, we could store all the data which satis- fied a commonly occurring query in consecutive physical blocks. In many existing data base systems only a small percentage of the data brought in from secondary storage actually satisfies the user request. This increases the number of I/O requests needed to satisfy the query, degrading response time. The experimental system must be able to support a variety of clustering schemes. In particular it must allow a comprehensive internal data collection capability so that new clus- tering techniques may be developed, implemented, and analyzed. Integrity In a distributed data management system a single mechanism is responsible for the maintenance of the data base, although that mecha- nism may be implemented as multiple processes running on multiple hosts. All maintenance activity is coordinated in the same way. In contem- porary environments data is captured in different ways, is subjected to varying degrees of scrutiny and authentication, and is inserted into disparate data management systems. It is not surprising that severe integrity (consistency) problems exist when data from several different systems is merged. A distributed data management system can help solve 12 some of the integrity problems outlined above. However, a distributed system does introduce new problems. Because copies of data bases are being maintained at multiple sites, coordination of updates is essential. The major research need in this area is to provide effective synchronization mechanisms. The experimental system should provide a framework for further research in integrity. As a longer term issue, the system can be used to investigate those aspects of integrity which are more closely related to the user's view of the world, i.e. the data capture and validation problems . 13 Required Attributes of the Experimental System Introduction Many system features are considered necessary to provide a flexible experimental tool. Some of the most important ones and a brief discussion of factors motivating them are presented here. Transparency of Storage Structure and Data Location to Application Programs Description . A user application program must not contain implicit or explicit assumptions about the structure of the underlying data manage- ment system (DMS). All interfaces between the DMS and programs must take take place via well-defined, appropriately-chosen primitives which do not make such assumptions necessary. Motivation . Since the system is experimental in nature, it is expected that major changes in approach will be made frequently. The changes will occur both during preliminary design and during concept testing and experimentation. Because of the expense of designing and constructing suitable test programs and the desirability of uniform benchmarking during experimentation, it is highly desirable to be able to use a single set of test programs, essentially without modification, for all testing. Also, since the DMS and test programs are essentially independent, their design and construction can proceed independently without unnecessary delay caused by design changes. Another very important advantage to be gained by the independence of programs and the DMS is the possibility of implementing versions of the DMS on different host systems, each one being optimized internally for operation on its respective host, but presenting a uniform appearance to application programs. 14 Freedom of Structure Realizations Description . A rigid structure must not be imposed on the data model for artificial reasons, e.g. to fit some computer manu- facturer's de facto standard. Within the data model chosen, all pos- sible flexibility will be retained to allow unanticipated changes to be made as easily as possible. Motivation . The ideal structures to represent a data base are not obvious. Tradeoffs must be made between factors such as overhead for expressing structure, structure creation cost, updating cost, and retrieval time. If some of these factors are rigidly controlled by the data model or the cost of changing the structure of the experimental data bases, fruitful research paths may be bypassed. Flexibility in Indexing of the Data Base Description . When indexing of a data base on some variable is deemed necessary or desirable, independent indexing structures can be generated. These indexing structures need not be defined when the data base is created, and are visible to the user only by their effects on query response time. Their presence is solely an efficiency considera- tion, and is not required for correct operation. Motivation . When executing query operations, an index struc- ture can often reduce or eliminate the need for searching a data base. However, a good choice of variables to index depends heavily on the query patterns, and may in fact change on a short-term basis. If the index structure is part of the data base, a restructuring of the entire data base is usually necessary to change it. However, if the index structure is separate, it can be created and destroyed without affecting the data base. This capability opens new areas for experimentation, 15 particularly in the area of resource utilization vs. response time tradeoff optimization. Simple and Modular Data Model and System Description . The data model must be conceptually simple and easy to experiment with. The DMS itself must be easily understood and easily modified. Motivation . Because of the experimental support role the system is intended to play, the majority of time spent on the system should be in areas which actively contribute to the research goals. Time spent understanding complex aspects of the system, unless they bear directly on some experimental goal, can be considered wasted. Such wasted time must be minimized. Exploitation of Parallelism Description . Often, many of the operations required to per- form a query can logically be performed simultaneously. The data model used must not limit or restrict the use of parallel evaluation more than absolutely necessary for integrity/synchronization purposes. Motivation . In a network environment, the opportunities for optimization of query response by the exploitation of implicit paral- lelism are greatly increased. The unique possibilities opened by the availability of several computers, each with possibly unique capabil- ities, must not be prematurely or unnecessarily restricted. In order to experiment in this area, a non-restrictive theoretical model must be used. Restrictions can be imposed upon such a structure without dif- ficulty, whereas if a more restrictive theoretical base were chosen, some experiments would be impossible or too contrived to give useful results. 16 Simple and Modular User Interface Description . The system, as viewed by the user, must be simple enough for new users to learn with very little instruction and practice. Complex and/or non-intuitive features must be avoided as much as possible. The interface must be of modular design so it can easily be replaced by a different one. Motivation . Since the system is not intended for production use, only those features necessary to carry out the desired experiments need be implemented. More importantly, the users of the system will generally be concerned primarily with the operations the system performs in getting results and not so much with the results themselves. In this context, learning effort and complex command entry represent wasted research time and hence should be minimized. Additionally, if end-user comment, such as from potential WIN users, is desired on some point, the familiarization time for a novice user must be reasonably short to minimize wasted personnel time. In the event that such comment suggests major changes to the user interface or suggests more extensive user testing, the interface must be easily modified or replaced to increase its usability. 17 Data Model and Machine Choice Data Model Selection Summary . Three data models are considered: hierarchical, plex (network) and relational. The major differences among the models are the kinds of structures which the user must define and utilize in addition to the actual data. The relational data model has been selected for the experimental system because of its greater flexibility and conceptual simplicity. An analysis and brief tutorial follows for each model. Hierarchial model . The hierarchial model allows only simple, tree-like structures (Figure 1) . DEPT. 1 EMPLOYEE EMPLOYEE DEPT. 2 EMPLOYEE EMPLOYEE Figure 1 Hierarchial Structure 18 An important feature of this data model is that information is contained in the structure as well as in the data stored in that structure. A part of the tree cannot be separated and used independently without also transferring information about where that part was in the original tree. For example, in Figure 1, the information about the department where an employee works is contained in the structure of the hierarchy, i.e. the "owner" department for an "owned" employee. This data model has the chief advantage of being closely analogous to the physical storage representation of the data base. This can lead to improved processing efficiency. However, this feature can also be a major disadvantage. Suppose it is desired to access the data base in a different way, such as finding out whether any employees work in two departments. Because the department-to-employee relationship is manifest in the structure of the tree, it is not easy to extract the information in a different form, i.e. employee-to-department. Considerable work has gone into the translation of data bases, in order to be able to produce a new data base with, for example, an employee-to-department structure. In a distributed environment, additional disadvantages are associated with the hierarchial model. Due to structural consider- ations, the "units of distribution" are complete subtrees. That is, it is not reasonable to distribute an arbitrary collection of data segments, If this were done with no restrictions, it might become necessary to "bounce" between two hosts in traversing the data base. The problems of information content in the structure continue to be bothersome. Experi- ments in restructuring and maintaining several copies of the same data in different physical structures would be difficult. Finally, it is potentially quite difficult to exploit the parallelism of a network. 19 Going into a hierarchial structure is a sequential process because the path taken conveys information. Thus, a hierarchial structure would not appear to be well-suited for the experimental system. Plex Model . The plex model is an extension of the hierarchial model which allows a much wider variety of linkages between the data segments. Thus, hierarchies may be created as special cases. The CODASYL Data Base Task Group (DBTG) has presented one of the more com- plete plex models. Their model allows this additional structure to be expressed as SET membership. Such membership is manifest, conceptually at least, as chains of data segments (which DBTG refers to as record occurrences ) . Each record occurrence may be linked to a predecessor, a successor and to a so-called set-owner record. The other concept from the DBTG work which we shall discuss is that of an AREA. Basically, an AREA is a mechanism by which the data base administrator (DBA) can partition the data base into more manageable portions. The DBA can control the placement of the AREA on physical storage. The AREA may also be employed as a logical construct to be used for clustering, such as the grouping together of records to be used by a procedure. The CODASYL DBTG schema (data base structure definition) may be restricted so as to minimize linkages from one area into another. A plex system would be potentially an attractive data model were it not for the lack of flexibility explicit structure definitions impose. Because the user is aware of these structures, they cannot be changed at different copies of, say, an AREA. Each copy must preserve the logical structure defined in the schema. Another practical limiting factor is the fundamental complexity of plex structures and their defini- tion. Considerable manpower would have to be invested in schema analyzers 20 etc. This manpower could be better utilized investigating questions which are truly distributed data base issues. Finally, there is the question of performing controllable experiments. In a plex model, the structure may induce inherent inefficiencies. For example, some queries simply cannot be answered efficiently due to the structure of the data base. It would be difficult to separate out the effects of the data base structure in an experiment done with such a system. Relational model . The relational model was proposed as an attempt to improve the theoretical foundations of data base management and to provide a very simple user structure. In the short time since its initial suggestion, the relational model has generated tremendous interest within the academic and research communities. We have selected a relational data model because of its simplicity and flexibility. A relation, R, may be thought of as a table whose columns are labeled. Since the columns (which are called domains ) are always referred to by name, their order is unimportant. The number of domains is called the degree of R. The rows of the table (which are called tuples ) have the following properties: 1. each tuple is distinct, 2. the ordering of the tuples is insignificant, and 3. the domain values in the tuple are atomic, e.g. character strings and integers but not structures or repeating groups. Why has such a simple concept generated such interest? First, because the model is so simple, even the most computer-naive user can understand it. Second, because the relational model has no user-supplied structure, the system designer and data base administrator are free to provide any structure they wish "underneath" the user interface. The DBA is free to 21 index, sort, or cluster a relation in any way he wishes. Further, such structures may be changed at will, with the only impact being on system response times. The user interface is fundamentally an associative view of the data. He supplies criteria and the system gives back the tuples meeting those criteria. This is a good place to define the operators for the relational model, even though such definitions do not impact upon the selection of a data model. These definitions will be very brief, being sufficient only to understand the examples and terminology used throughout the sequel. Previous work on the relational model has led to the definition of two different sets of operators: the relational calculus and the relational algebra. In the calculus the user gives a first-order predicate calculus description of what he wants his output tuples to be like. In the algebra the user describes explicitly the operations necessary to achieve a desired output relation. We have chosen an algebra context for three major reasons: 1. The algebra deals with relations as objects rather than tuples. It is thus at a somewhat higher conceptual level. 2. The algebra is a close approximation to the way in which the system would actually be implemented. 3. The calculus requires additional transformations to determine what operations need to be performed to satisfy a query; the algebra is more explicit. Since it can be shown that the two languages are comparable in their expressiveness, there is no loss of generality by our choice of the algebra for examples. 22 Only three operators will be used in subsequent examples, projection, restriction and join. Each takes one or more relations as operands and produces a relation as output. To aid in the definition, examples using the following relations will be presented. SUPPLIER: SPLR.NAME SPLR.# STATUS CITY Ajax 10 9 Tuscola Acme 20 5 Areola Widget 30 6 Mattoon Bomad 40 8 Urbana PARTS: PART.// PART. NAME COLOR WEIGHT \ 1234 widget blue 100 1235 widget green 100 1236 bolt blue 1 1237 bolt green 1 1238 nut purple 1 1239 , .- . .... . nut blue 1 SP: SPLR.# PART . // ! io 1234 10 1235 10 1238 20 1239 20 1237 30 1234 30 1235 40 1236 40 1238 Projection takes as input a relation and an ordered domain list and produces a relation containing only the domains specified in the domain list. For example, to produce a list of the suppliers and their locations we write: SUPPLIER [SPLR.NAME, CITY] which produces the following relation: 23 SPLR.NAME CITY Ajax Tuscola Acme Areola Widget Mat toon Bomad Urbana The domain list is enclosed in square brackets immediately after the relation. Projection may also be used to simply re-order the domains of a relation by including all of them in the projection domain list. Restriction takes as inputs a relation and a Boolean expres- sion of the domains of that relation. It produces a relation with the same domains as the original input relation. The tuples of the output relation are exactly the tuples of the input relation which satisfy the input Boolean expression. Strictly speaking, restriction is redundant, since its effects can be achieved through combinations of the other operators. However, it represents an easily understood and concise way of stating a commonly used operation. To find the suppliers whose status was greater than 6 we would write: (SUPPLIER where STATUS > 6) which would produce the following relation: SPLR.NAME SPLR.// STATUS CITY Ajax Bomad 10 40 9 8 Tuscola Urbana Restriction may be combined with projection, both by projection on the input relation or on the output. For example, to find the supplier names and status of suppliers whose status is greater than 6 we write: (SUPPLIER where STATUS > 6) [SPLR.NAME, STATUS] or (SUPPLIER [SPLR.NAME, STATUS] where STATUS > 6) 24 Either one produces the following output relation: SPLR.NAME STATUS Ajax Bomad 9 8 As was shown in this last example, many alternative statements exist to do the same thing. This fact allows a variety of optimization tech- niques to be applied, as will be discussed later. The simplest type of join takes as input two relations Rl and R2 that have a common domain D. It produces a new relation whose tuples are formed by concatenating a tuple from Rl to one from R2 with the same value for D. For notational convenience we will assume that D is the last domain of Rl and the first domain of R2. To find the supplier numbers of suppliers who supply green parts we write: ((PARTS where COLOR = green) [PART. //] * SP[PART. #,SPLR. //] ) [SPLR. //] The join is denoted by the *. This could also have been written: ((PARTS [PART. #, COLOR] * {green}) [PART. #] * SP [PART.//, SPLR. //]) [SPLR. #] Thus, join may be substituted for certain cases of restrict. Both these examples produce this output relation SPLR.# The following example may help to show what join does. We will go through join on a tuple by tuple basis. SP [SPLR. //, PART.//] * PARTS [PART .//, PART. NAME, COLOR, WEIGHT] taking the first tuple from SP, <10, 1234>, we find the tuple in PARTS that has the same value for PART.//, <1234, widget, blue, 100>. The 25 concatenated tuple <10, 1234, widget, blue, 100> is now output, it is in the output join relation. This process is repeated for each tuple in SP. Of particular interest is the SP tuple <30, 1234>. Semantically this means that both supplier 10 (Ajax) and supplier 30 (Widget) supply part number 1234, the blue, 100 pound widget. Thus, both the following tuples are in the join: <10, 1234, widget, blue, 100> and <30, 1234, widget, blue, 100> This example eventually produces the following output relation: SPLR.# PART . # PART . NAME COLOR WEIGHT 10 1234 widget blue 100 10 1235 widget green 100 10 1238 nut purple 1 20 1239 nut blue 1 20 1237 bolt green 1 30 1234 widget blue 100 30 1235 widget green 100 40 1236 bolt blue 1 40 1238 nut purple 1 The structure of the three relations in this example results from a process called normalization. Discussion of normalization is beyond the scope of this document. However, its practical significance is that the large conceptual relation that describes parts and their suppliers must be broken down into SUPPLIER, PARTS and SP. This breakup avoids certain types of problems associated with updates and deletions. Unfortunately, this breaking up of the conceptual relation means that the user must remember much more obscuring detail. For example to find the name of the supplier of some part, two joins are necessary. In an attempt to alleviate the problem the notion of a view has been introduced, 26 A view is the relation that the user deals with. A view is made from operations applied to the basic relations of the system. For example, suppose we have a user who only works with widgets. There is no need for him to be aware of the existence of other types of parts. Further suppose that all he does is write orders to the widget suppliers, Ajax and Widget. Our user has no need for the supplier number and should not be burdened with projections to omit it. This user would have a view defined by the following expression: VIEW1 = (SUPPLIER * SP * (PARTS where PART. NAME = widget)) [SPLR.NAME, STATUS, CITY, PART.//, PART. NAME, COLOR, WEIGHT] (The projections needed for the joins are omitted for clarity.) To this user the data base looks like the single relation VIEW1: VIEW1: SPLR.NAME STATUS CITY PART.// PART. NAME COLOR WEIGHT Ajax 9 Tuscola 1234 widget blue 100 Widget 6 Mattoon 1234 widget blue 100 Ajax 9 Tuscola 1235 widget green 100 Widget 6 Mattoon 1235 widget green 100 All of his queries may be directed towards VIEW1, rather than the underlying relations. In a distributed DMS a relation may actually be made up from pieces stored at several network hosts. We shall call this concept inverse views. An inverse view would be useful if most of the use of the inverse view is local, but occasionally the entire relation is needed. For example, the FORSTAT data bases now form an inverse view. Each command operates on a portion of a large conceptual data base. This conceptual data base is the FORSTAT for every unit in every command, A distributed data base requires a great deal of flexibility from the underlying data model. Duplicate copies of a portion of the 27 data base should be free to be organized in any way. This will allow for experimentation with the effects of a heterogeneous data base even when done on a homogeneous network. A given relation may be replicated at different hosts. If each copy has a structure which is optimized for particular types of retrieval, the distributed DMS will be free to select the most suitable copy. A user request which requires access to multiple relations may be able to exploit the parallelism of the network by utilizing copies of relations which are stored on different hosts. Finally, by choosing a simple model like the relational model, con- trolled experiments can be performed. These experiments will not be clouded by the effects of the data structure. Machine Selection Summary . Multics at MIT will be used as the primary software development site. Multi-site experiments can be supported using addi- tional Multics sites on the ARPA Network. The decision to use the ARPA Network and Multics is based primarily of the availability of those resources, significant local expertise and the clearly superior support Multics offers to the software developer. Criteria . The purpose of the experimental distributed DMS is to support experimentation. Consequently, the target computer system selection must be based on the effectiveness of the target system as a program development tool. This factor outweighs conventional criteria such as cost and locality, since programmer time is limited and must be used effectively. Other factors must be considered including: 1. cost, 2. accessability , 3. short-term reliability and availability, 28 4. availability for the duration of the research period, 5. response time for interactive use, 6. local expertise, 7. possibility of transporting the distributed DMS to other systems , 8. possibility of use of pieces of an existing DMS, 9. availability of suitable language(s) on the system, 10. availability of suitable network access, and 11. availability of at least two systems on the network for testing purposes . A detailed comparison of the available systems would be expensive and would probably be of little value, since weightings for the various factors would be largely subjective and hence could be chosen to favor any desired system. A subjective choice, provided it is reasonable, might as well be made and identified as such. The designers of any system draw on their own past experience when making otherwise non-obvious choices. Some conclusions based on past experiences of this design team will be described here so the reader can better understand the background of choices described later. The operating system support must be extensive. The con- struction of large, complex programs almost invariably uncovers obscure operating system problems. These problems may be circumventable, but often are not. Most manufacturer-provided operating systems or contractor-written support software cannot be fixed in a matter of months, much less the hours that users would prefer. The physical operation of the facility must be professional and provide adequate services. When a user who is using a system from a 29 terminal requests that an operator mount an offline storage volume (e.g., tape or disk), every second of delay costs the user time, money, and inconvenience. Some systems require literally hours, for example, to retrieve a file from an archive tape. This is unacceptable. Opera- tor mistakes are also unacceptable. Proper operator interface design and good operator training can essentially eliminate them. Operating hours must be reasonable. Programmers should not need to be overly concerned with pro- gram size and file storage space. While programmers should clearly be concerned with storage efficiency and wasted file storage space, these must not be allowed to become major concerns. The authors know of cases where man-months have been spent setting up overlay structures for large programs to pare as little as 30% from the memory requirement. Similar anecdotes probably exist for every non-virtual-memory machine in exis- tence. This group considers a non-limiting memory size an absolute necessity to avoid totally fruitless, but time-consuming effort. Similarly, file-storage-limited machines can cause tremendous amounts of effort to be wasted in shuffling files on and off the system. In one case in the authors' experience, over 20% of all programmer time spent over a several-month period by a team of five programmers was spent in file maintenance due to problems caused by limited file space. Appropriate high-level programming languages and debugging and measurement tools must be provided. The disadvantages of assembly languages are well-documented in the literature. The originally per- ceived advantages, namely reduced size and speed of the resulting code, have been consistently disproven for moderate and large programs. For debugging purposes, an interactive symbolic debugging tool is essential. 30 Learning a machine in sufficient detail to use memory dumps is costly and unnecessary. Alternatives . Many alternatives are available for the choice of a host system for the distributed DMS. The availability of the ARPA network provides a variety of machines and operating systems to choose from. Other possibilities are local facilities at the University of Illinois and GCOS time obtained via the PWIN. Local facilities at the University of Illinois are not networked and thus cannot support net- working experiments. Use of the PWIN will require unclassified opera- tion. The amount of time available for this mode of operation is very limited and would severely slow the research program. The ARPA net- work is available and adequate to the research needs. The CAC has over four years of experience using virtually all of the hosts on the ARPA network. Furthermore, we already make exten- sive use of the three major service hosts: UCLA's 360/91, BBN's TENEX and MIT's Multics. (A fourth host on which we have considerable experi- ence, UCSD's B6700, left the net this summer.) All three sites provide a production-oriented site operation adequate for our purposes. Of the three, Multics is the only one which meets all other criteria. UCLA does not provide a virtual memory system and TENEX does not provide an appropriate high-level language. Furthermore, costs at Multics are favorable and our past experience has shown that the software develop- ment tools available in Multics are advanced well beyond its nearest competitor. There are currently three Multics machines on the ARPA network. The two Multics sites other than MIT do not have an appro- priate production orientation for continued software dvelopment, but they are adequate for multi-site experiments. In view of the commanding 31 advantages of Multics for this particular research activity, it is clearly the machine of choice. Application to PWIN/WIN This section discusses how the choice of a relational data model to be implemented on Multics relates to the PWIN/WIN environment. The technical grounds for these choices are clear. The relational model captures the essence of the other competing data models while allowing unparalleled freedom in experimentation. Multics represents perhaps the best system in existence for support of experimental software development . Obviously, it would be an error to use a data model which, although elegant, could not be related to the realities of the PWIN/WIN environment. The transfer to the WWMCCS environment of results obtained from a system based upon a relational model may not be obvious. The relational model, in some ways, represents a "least common denominator" for data base systems because all the data associated with an entity is explicitly specified. A tuple is an atomic unit with no need for any context. Thus, controlled experiments can be performed on the effects of different organizations. These experiments can lead to better understanding of the general principles underlying any data management system. The Data Translation Project at the University of Michigan has demonstrated that typical WWMCCS data bases may be translated into and out of relational form. Indeed, a central concept in their data trans- lation system is the use of a relational form as a common intermediate structure when translating between two dissimilar data base structures. Thus, the experimental system could use a typical or representative 32 WWMCCS data base translated into relational form. Experiments performed upon the relational form might suggest reorganizations of the original data base to improve efficiency. In a distributed environment, we plan, as a longer range project, to investigate the impact of having different structures for the copies of a data base. In particular, we want to assess the effect of limitations on the structure resulting from restric- tions on the schema to sub-schema mapping. This knowledge can be directly applied in the design of functional specifications for an operational distributed data base system. The choice of Multics is also relevant to the WWMCCS com- munity. Multics hardware is quite similar to the WWMCCS standard hardware, particularly the peripherals. It may be possible to infer timings on GCOS machines from experiments performed on Multics. 33 Experimental System Design System Overview This section discusses the design of the experimental distri- buted data management system. Because this is a first cut at the design, we have tried to focus mainly upon the underlying issues. For the purposes of this document the system has been broken down into three components: the network data manager, the local data manager, and the network data base administrator support functions. In the actual imple- mentation of the system, it is likely that the distinction between these components will become blurred. To illustrate the relationship among the three components, we will begin by sketching the actions which occur in satisfying a query. In the rest of this section the three system components will be dis- cussed in more detail. The processing of the user query starts by parsing it into some internal form. This command language processing is best included in the discussions on the network data manager. The query (now in an internal form) then passes through two major stages. 1. Global optimization . The query is examined to determine the best way to satisfy it. This process may involve a trans- formation of the query into a more efficient, but equivalent form (e.g., change the order in which the operators are evaluated) . Included in the optimization is the selection of hosts on which to perform the required operations. 2. Query evaluation . This may be further broken down into two steps: setup and execution. In the setup phase, the processes which will perform the required operations at the various 34 hosts are initialized. These include local data managers which will actually retrieve data from stored relations, and the service functions of other network data managers. The network data manager service functions are the modules which perform operations like join on the outputs of local data managers. In the execution phase, the originating network data manager issues commands to the local and network data managers working on the query. As tuples which satisfy the query are processed, they are sent to the required destination, Occasionally a host may fail. The network data managers on surviving hosts must take the actions needed to switch over to backup copies of data bases. If, as is likely, the failed host was in the middle of processing queries, several possibilities exist. The network data managers who originated the queries may be able to salvage the partial results produced before the host failed. A backup copy can then be used to complete the query. Alternatively, the originating network data manager may decide to discard the intermediate results and restart the query at the optimization step. This new optimization would, of course, be performed based on the network configuration after the host failure. Throughout the system, statistics are being collected by the various network and local data managers for use by the network data base administrator. The network DBA may elect to move relations, add or delete indices, etc. based upon these statistics. Network Data Manager Overview . The functions of the network data manager are: 1. to accept user requests and process the commands into an internal form, 35 2. to perform a global optimization on the user's query by re- ordering operations and determining where (at which hosts) the operations needed to satisfy a request will occur, 3. to initiate and monitor these operations, and 4. to perform relational operations in response to requests from other hosts. The last function is not, strictly speaking, part of the network data manager, but this is the natural place to include the discussion. In the example which immediately preceded this section, a typical query was followed through these various functions. In this section each function will be discussed in somewhat more detail. The emphasis throughout this discussion on the experimental system design has been on the most basic considerations. It is somewhat premature to offer a detailed design of the system. Command language processing . The user language must be as simple as possible while still providing the necessary power of expres- sion. This is in keeping with the philosophy of simplicity described earlier. The language will be a command and query language with inter- pretive execution. In order to execute a query, the form of the query must be checked to verify its syntactic and semantic correctness, and then transformed into a form suitable for computer interpretation. This process is called parsing and translation, and resembles the first steps performed by a high-level-language compiler. The computer-usable form of the query, i.e., the query after parsing and translation, will be an n-ary tree (probably binary) . A tree structure has the advantage of clearly showing the flow of data 36 necessary to satisfy a request. Since the limited network bandwidth can be expected to be an important consideration in optimizing the retrieval strategy, a tree is good internal form. A typical tree is shown in Figure 2. The nodes labeled REL are basic relations. A basic relation is simply a relation tbat is stored on some storage medium or available via a network. The nodes labeled OP are operators. An operator per- forms some operation (e.g., a Boolean operation, a selection process, a sort, etc.) on one or more relations. The operators need not know whether the relations they operate upon are primitive relations or the results from previously applied (lower in the tree) operators. Global optimizer . It is often possible to greatly reduce the effort required to fulfill a query by rearranging the order in which operations are performed. For example, consider the tree in Figure 3. This tree might represent the query (in English) : "Show me the amount of income tax paid by all managers making over $30,000.00 per year." If REL.. relates employee number to income and job position, and REL„ relates employee number to income tax paid, the join operator in Figure 3 produces a composite relation which relates employee number to income, job position, and income tax. The restriction operator then removes those employees who are not managers making over $30,000/year , leaving the result desired. In Figure 4 the restriction has been moved. In this tree, the restriction produces only those employees who are managers making over $30,000. The join then produces the desired result. The restriction operator was probably only a little cheaper to perform in this case because it was handling relations with fewer fields (domains) However, the join operation only had to process those employees from REL which met the criteria, which will usually be a much smaller number 37 RESULT Figure 2 Query expressed as a tree 38 RESULT > Figure 3 Query tree before optimization RESULT Figure 4 Query tree after optimization 39 than the total employee relation. Hence, the latter tree will probably be at least as efficient as the first and usually more efficient unless all employees are managers making over $30,000. The above example demonstrates one small aspect of high-level query optimization. There are several classes of optimization that can be used, and optimization can be performed with many different objec- tives, e.g., minimizing cost, time, or resource utilization (such as temporary storage), to fulfill the query. A few of the techniques for optimization that can be used follow. 1. Query tuning . This generally refers to using knowledge or guesses about the nature of the data base to rearrange the order of operations. In the above example, if it were known that all managers made under $30,000, the query could be filled (vacuously) by the optimizer. 2. Rearrangement of operators . The example above is a simple case of operator rearrangement. Frequently, the improvement is much more dramatic than in this simple example. 3. Optimizing according to availability of indices . By re- arranging the tree, it is possible that evaluation can take advantage of sort order or indices on one or more relations. Either can be used to eliminate the need for searching an entire relation. 4. Optimizing by locations of relations . This is discussed in the next section. As a quick example, it might be possible to minimize network transmission by using two relations located at a single host. 5. Optimizing for maximum parallelism . This may involve per- forming part of the query at different hosts or arranging the 40 order of evaluation so that the expression tree is as "broad" as possible but no "deeper" than absolutely necessary. ("Broad' refers to the number of operators at each level in the tree, "deep" refers to the number of levels.) The former case is discussed in the next section. The optimization problem can be considered to be that of minimizing the value of a cost function which takes all pertinent factors into account. Such a function would be developed by the Modeling effort. Query evaluation . As described earlier, the query entered by a user is parsed and transformed into a tree (Figure 2) . The query would be evaluated by traversing the tree and evaluating each node as it was visited. The order of visiting must be chosen so that the arguments to an operator are available when the operator is evaluated. (One possible order of evaluation of the tree in Figure 2 is reflected in the node numbering shown.) One problem with this simple method is that any potential parallelism of evaluation cannot be exploited. Hence, we must use a more sophisticated evaluation method if we are to give prompt responses to queries. The following description is intended only to capture the flavor of the proposed technique. It has been modified and simplified for ease of presentation. Part of the function performed by the tree optimization step is to decide where (i.e., at which host in the network) the parts of a query can be most efficiently evaluated. In general, a tree may be broken apart before evaluation, and spread through the network. To simplify the job of the query evaluator, all nodes of a tree are con- sidered to be relations. This is true of actual relations stored on an external stroage medium, intermediate results of operators, and relations 41 being computed at other locations and returned via the network. Only the routines used to physically access the relations may have to dis- tinguish between local, remote, and intermediate relations. In order to take advantage of parallelism and provide dynamic control over the size of intermediate files, the tree will not be evaluated serially, as described earlier. Instead, it will be evalu- ated, as much as possible, from the bottom to the top, in parallel. This allows adjustment of evaluation parameters (such as intermediate storage) to a greater degree possible than for serial evaluation and also permits any possible parallelism to be exploited. In order to help exploit parallelism and reduce intermediate storage requirements, a technique referred to as "streaming" or "pipe lining" can often be used. This scheme consists of implementing opera- tors which do not require entire relations to be present as inputs before producing an output. The data flows into such an operator in a "stream" and flows out in a "stream." For example, projection and restriction can operate on a streamed basis. In general, join and sorts must wait until all the needed tuples are avaialble. This technique, combined with the parallel evaluation technique above, produces the first result tuple from a query as soon as possible, and produces further tuples as they are available. A normal evaluation technique, using intermediate storage ("hit files"), typically completes each operation before starting operations waiting for its results. The nature of the operators that can be streamed and the mechanics of the streaming operation itself are under study. Service and utility functions . In the system described, there is no necessity that all the operations needed to fulfill a request be 42 performed at the originating (user) site. Indeed, substantial ineffi- ciencies could occur from transmitting large, intermediate relations over the network. So, a mechanism must be provided to perform service requests. For example, suppose a user on host A issues a request which can best be fulfilled by doing a join on two relations stored at host B. The network data manager at host A must inform the network data manager at host B of this decision. The server (host B) data managers cooperate to retrieve the tuples and send the results to the appropriate destination. Notice that the destination of the output might even be some third host. Considerable effort must go into the design of protocols to accomplish the desired communication. The user (host A) and server (host B) net- work data managers must also cooperate on the choice of the particular method of performing the desired operation. For example, the join should take advantage of any indices or sorts which can be used. Many of the support functions required for the operation of the distributed DMS are not specifically DMS functions. This includes communication packages, file dump/restore capability, sort packages, and others. Clearly, the system used should provide as much of this service as possible. Some other DMS support functions are not used during normal operation of the DMS, but are part of the DMS. These include file and index integrity checking routines to aid in recovery and debugging, crash recovery programs, report generating programs, system operation monitoring routines, and others. This type of support must be designed and built, and may require a substantial amount of effort. Local Data Manager Introduction . The functions of the local data manager are: 1. to retrieve tuples from a basic relation, 43 2. to insert tuples into a basic relation, and 3. to gather statistics for use by the network data base administrator . (By basic relation we mean one which is not derived from any other relations.) As was briefly mentioned in the system overview, the local data manager resident at a host operates only on the data local to that host. The precise division of labor between the network and local data managers is likely to be smeared in an implementation. For example, considerable efficiency can be gained by the local data manager's being aware of what use is intended for the retrieved tuples. In this section we will discuss each of these functions. Retrieval . For retrieval, a network data manager sends to the local data manager a relation name R, a selection function F, which is either true or false for each tuple in R, and a sort criterion for the output tuples. The local data manager must extract all the tuples from R for which the selection function is true and supply them to the net- work data manager in the appropriate sorted order. Both the selection function and the sort criterion may be null. If the local data manager were to perform these tasks in a brute force manner, its performance would be totally unacceptable. Rather, the local data manager must be able to employ aids such as indices, clustering, and expression optimization to make the data retrieval as efficient as possible. To discuss these concepts more completely, it is necessary to have some additional definitions. Each domain of a relation may have an index defined for it. This index is an additional data structure which relates the values of that domain to (sets of) tuples (possibly) containing those values. A 44 domain is completely inverted if, given a value, we can always determine from the index exactly which tuples contain the value. A domain is partially inverted if, given a value, we can determine from the index a set of tuples guaranteed to include all the tuples containing the value, but possibly including some others. A relation may be clustered on a domain (or collection of domains) . Clustering can improve the efficiency of accessing the relation because we can improve the hit rate (i.e., a greater percentage of the tuples accessed will satisfy the retrieval specifications) . Thus, fewer I/O operations are necessary to retrieve all qualified tuples. Clustering may be combined with indexing. The work necessary to reply to a retrieval request on relation R, satisfying qualification function F and sorted by sort criterion S is: W(R,F,S) = work to process indices + work to retrieve tuples + work to qualify retrieved tuples + work to sort qualified tuples What we seek is to minimize the total work to reply to the retrieval request. However, since the terms are related to each other, we cannot simply minimize each term. In the following paragraphs we will discuss each term and some of the interdependencies. The starting point in optimizing local retrieval is the quali- fication function F. Let us consider functions which are Boolean expres- sions of domains and constants, such as the following: Fl F2 F3 AGE < 25 and SALARY < $5000 PART COLOR 4 GREEN and SUPPLIER ± JONES SALARY > (2*STD DEVIATION (SALARY) + AVG( SALARY)) 45 We can minimize the work to retrieve tuples by using indices where possible. For example, with Fl as the qualification function, an index (complete or partial) for either AGE or SALARY could be profitably used to cut down the number of tuples to be retrieved. We could process the index to determine in advance which tuples could not possibly satisfy Fl. If both AGE and SALARY were indexed, we could perform an inter- section on the two indices and retrieve even fewer tuples. This does increase the work expended in index processing, and also may decrease the work in qualification. (If both AGE and SALARY are completely inverted, for example, every tuple retrieved satisfies the qualification expression.) Thus, indices should be used when they substantially reduce the number of tuples to be retrieved. In the second example it is probably not advantageous to use indices. Suppose both PART COLOR and SUPPLIER are completely inverted. We might spend considerable time intersecting indices to exclude only a few tuples. This can result with any comparison operator and is not simply an effect of the ^. In one intelligence document data base that the authors are familiar with, over 80% of the documents had SOURCE COUNTRY = USSR. The system designers simply prohibited the use of that comparison in retrieval qualification functions. An index for SALARY would be of little value if F3 were the qualification function. There, the values of interest are not a con- stant but are derived from operations on the domains of the relation. In general, expressions where two domains are compared are not amenable to index processing. If queries like F3 are frequently processed, we may wish to calculate the AVG and STD DEVIATION and remember them for future use. 46 These examples are intended to illustrate the non-trivial nature of index processing. We are continuing to study the subject in an effort to determine the general principles involved. These parti- cular examples are all relatively clear cut; in a practical system, decisions will often have to be made in far less obvious circumstances. There is also a great deal of work to be done on effective index structures. Ideally, one would like the index to yield the desired tuple identifiers after a minimum of processing. In practice, the structure which is optimal for some queries may prove to be totally unsuited to others. Very little theoretical work has appeared on the subject. We plan to consider a wide variety of structures. By modu- larizing the index functions, we can experiment with different alter- natives with a minimum of effort. The actual mechanics of tuple retrieval are heavily dependent upon the physical realization of the relation. After index processing we have a collection of tuple identifiers (or identifiers of blocks of tuples from a partially inverted domain index) . Conceptually we use these identifiers as pointers into an array of tuples. Practically, there may be transformations applied to achieve the identifier-to- tuple mapping. At the tuple level, considerable freedom exists as to the physical realization of the domains. In one system (RDMS on Multics) which we have investigated, tuples are realized as a set of pointers into sorted arrays containing all the values for each domain. Such a technique has the advantage of having each tuple in a relation being a fixed length. Comparison operations may be performed on these pointers with the same effects as on the actual values. The disadvantage of this particular technique is that such indirection may cause additional I/O 47 operations to be performed to actually retrieve the tuples. Another very attractive scheme would be to utilize a sophisticated data com- pression technique to reduce the size of the relation. The compression may be viewed as a form of indirection, since the values may be stored in a decompression table. The brute force methods may prove to even- tually be an acceptable alternative. Here, a tuple is stored as a self- defining collection of domains. No indirection or compression is used. As briefly mentioned above, the qualification of retrieved tuples may be profitably altered if we are able to use indices. Certain sub-expressions can be replaced with constant true or false values. These constants should be propagated as high as possible in the evaluation tree. For example, consider the following retrieval speci- fication: F4: AGE = 25 and SALARY > $20,000 If AGE is completely inverted, the qualification expression should become: F4 1 : SALARY > $20,000 This occurs because we can use the AGE index to guarantee that every tuple retrieved has AGE = 25. The exact order of evaluation should be optimized by taking into account known (or observed) characteristics of the retrieved tuples. A tuple should be qualified (or disqualified) as soon as possible. Ideally, we should use the probability of a condition being true for a tuple to reorder the evaluation. The theory here exists, but the practical difficulties of getting the required data obviate the effectiveness of the more elegant techniques. After the tuples have been qualified, they may need to be sorted. The literature on sorting is very rich. There are several 48 algorithms which would be suitable. If we are clever, however, it may- be possible to avoid the actual sorting of the tuples. If the relation is already sorted on the appropriate domains, all we have to do is maintain that ordering through the subsequent processing of the tuples. This seemingly remote possibility may actually occur fairly often, (e.g., when previously normalized relations are "joined back together). There, we would like both relations sorted on the joining domain so that a streamed join would be possible. Insertion . Conceptually, insertion of a tuple into a relation is a trivial operation. The new tuple is simply appended to the relation. This conceptual simplicity is one of the major attractions of the rela- tional model. In practice, it may be desirable to increase the com- plexity of insertions in order to speed up retrievals. Devices such as indices and sorts can greatly improve retrieval performance. They complicate insertion by requiring more objects to be altered to complete the operation. Since an insertion requires exclusive use of the rela- tions being inserted into, including indices, the insertions may seriously restrict the system's retrieval capacity. In a distributed environment the problems are heightened. All the copies of a relation must be updated. The different copies are likely to be organized in different ways. Views or inverse views may be utilized. Potentially the most vexatious problem is synchronization. Consider the following situation. There are two relations Rl and R2 located at different hosts. Each relation has a queue of user requests pending for it. We have just received a user request which requires insertion of a tuple in both Rl and R2. (This could come about, for example, if Rl and R2 were normalized relations which resulted from a 49 single relation.) We must wait until both Rl and R2 are free. We cannot insert the tuple in Rl at one time, then later insert the tuple into R2. If this were done, operations like join would not work cor- rectly, or worse, would give erroneous results. This synchronization scheme would leave us waiting for the last relation needed to come free. Meanwhile the previously locked relations would be unavailable for use. Another technique for minimizing this problem is to have a reader copy of each relation and a writer copy, and periodically switch roles, doing a batch update on the former reader copy before making it the new writer copy. The problem with this second scheme is that we cannot see recent updates. In an on-line data base system, these may be precisely the ones in which we are most interested. Any attempt to read from the writer copy will return us to the problems of the first scheme. The problem of synchronization in a distributed data base is, thus, not at all easy to solve. It is an area of continuing research. These two examples were intended merely to point out some of the diffi- culties. Beyond the mechanical aspects of insertion lie the larger issues of integrity and consistency. Insertion is the most logical time to focus our integrity interests. A given tuple may be processed and verified before it is inserted. Certain fundamental integrity con- straints may be simply considered in a relational model. These funda- mental constraints relate to non-duplication of values in distinguished domains called primary keys, and to the preservation of the uniqueness of key domain to non-key domain mappings. Other integrity constraints can often be expressed as queries which must always (or never) be satisfied by an inserted tuple. Thus, if each insertion is processed 50 against these integrity constraints, we can guarantee that they are met for the entire data base at all times. The local data manager may not always be the best place to include these broader integrity checks. For example, they may span several relations. Measurement . The local data manager is the logical place to collect many of the statistics which will be used by the network data base administrator. These statistics would be the input for decisions on data base restructuring, index creation or deletion, clustering, allocation of relations to hosts, etc. All of these are basically static decisions. The network DBA makes or reviews his decisions at relatively infrequent times. The kinds of data needed for such deci- sions include: 1. summaries of the tuples returned for the retrieval requests received, 2. summaries of the types of qualification expressions used, 3. summaries of the output sort criteria used, and 4. measurements of the work expended in index processing. This list is by no means exhaustive. As the system design is refined, the exact nature and characteristics of the necessary system measure- ments in support of the network DBA will become clearer. Another class of measurements are of a more dynamic character. These are used as input for the various strategy decisions made for each query by both the network and local data managers. In the preceding parts of this section and in the discussion of the network data manager, we have shown how some of these measurements may be used. These dynamic measurements include: 1. responsiveness of the host, 51 2. probabilities of particular logical qualification expressions being satisfied, 3. limited retention of recent sort information, and 4. retention of derived values such as average values, counts, standard deviation, etc. Again, as the system design is refined, the exact character of these measurements will become clearer. Network DBA Support Functions Introduction . This section is concerned with those functions which the network DBA performs to keep the distributed DMS running efficiently and smoothly. On existing single site systems he takes two classes of actions to reduce costs and improve response times. The first class is best typified by the reorganization of files on disk to maximize channel utilization. Any alternative will require the same amount of storage, and the same amount of I/O will be used by queries. However, response times will be improved because of the superior overlap of I/O and CPU. The second class involves trading off the costs of two or more computer resources to optimize system efficiency. Such is the case when the DBA decides to add an index, to reduce the cost and time required for a class of queries. However, this is at the expense of index genera- tion, and increased costs for updates storage. This type of decision must also be based on a very good understanding of current and antici- pated user needs. This problem is greatly complicated in a distributed environ- ment where users with divergent interests are accessing the data base from a large geographic area. The network DBA would have to understand 52 the time-varying needs of the user communities and the total load placed on the system by the interaction of these needs. As this is an formidable task, the system itself must provide sufficient performance statistics for the DBA to control the system's behavior. The first step is to determine those parameters which strongly influence system performance. Much of the above description is equally applicable to a single site DBA. Indeed, a goal of a distributed DMS is transparency. The users need not be aware of the existence of the network or of the fact that not all the files are available locally. The major added problem for the network DBA is that remote files can only be accessed by relatively low bandwidth channels. The effective network bandwidth of 30 kbit/sec is a factor of over 200 lower than the transfer rates of .8 to 1. Megabyte/sec from advanced technology disk systems (e.g. HIS DSS 190, IBM 3330). Thus, the network DBA must carefully consider the allocation of files to the network hosts. The network DBA does have the flexibility added by the existence of multiple copies of files. As long as the information content of the copies remains the same, each copy may be optimized for a particular class of operation. The reliability advantages of multiple copies would still exist, and conflicting user requirements could be met. In the remainder of this section we will discuss some of the functions the network DBA performs and the considerations that go into his decisions. Definition of relations . When a new relation is defined, its domains, access control, and integrity constraints must be specified. As these decisions are based on real world knowledge, the DMS can do little to help the DBA. However, the DBA must also specify where to 53 locate the relation, how many backups it should have, and their place- ment. There are three factors which influence these decisions. First, the political constraints must be considered. A rela- tion may be critical to a command's mission. This can force the relation to be stored locally. Even if communication with the rest of the network is broken, the command will be able to maintain its operations. Security considerations might also constrain which hosts could store a relation. Second, the DBA must consider the impact of the relation on the performance of the various hosts. To assess this he needs to know the loads processed by the hosts and the amount this relation will impact load. Techniques for measuring system resources and the impact which another job would have on resource utilization are not well under- stood. Until such tools are developed, the DBA can make little use of this parameter. Finally, the impact on network traffic must be determined. Other relations will often be involved when evaluating a query. If they are not all on the same host, network traffic will be required. Also, if the computation is not done on the user's host, the results will have to be shipped to him. Because network traffic puts a load on hosts and is quite slow compared to disk-memory transfers, total "cost" is pro- bably very sensitive to this parameter. Dynamic allocation of relations . The original placement of a relation and its backups is based on political and security considera- tions and on estimates of the impact on the host's load and network traffic. After experience is gained with the relation, it may become desirable to move it to another host. Essentially, the same parameters as discussed above are involved in deciding where to move it. However, now the DMS can supply information on what other relations are used with 54 the relation to evaluate a query and where results are placed. The DBA can now choose the best site for the relation based on actual usage patterns rather than estimates. Availability of a data base . The availability of a data base is a function of the MTBF and MTTR of the host systems, the DMS, and communication lines to the hosts. Besides adding extra communications lines, the only way the DBA can improve the availability of the data base is to add backup copies on other sites in the network. The DMS then has the added responsibility of maintaining the backups as images of the primary copy. The DMS has to ensure that the backups receive all updates. This increases the local load, network traffic, and the load on the backup sites. As the number of backups increases, the real time required to keep them synchronized increases. It is hoped that the overhead required to maintain the backups will be offset by performance improvements resulting from altering the structure of the backups. The backups could be structured to be very efficient for classes of queries handled poorly by the primary copy. As long as the backup copy con- tained the same information as the primary, the availability would still be enhanced. Clustering . The idea here is to store the data so that con- secutively read records are often in the same block. Take, for example, a personnel file sorted on social security numbers where five records make a disk block. If a listing of the personnel file sorted on social security number is requested, the minimum amount of I/O will be used. However, if the listing is to be sorted on name, up to five times as much I/O would be required. It is up to the DBA to decide on which 55 domain, if any, to have the relation sorted. An alternative discussed below is to used an index to "remember" a sort. There are four parameters the DBA must evaluate when deciding how to order a relation. 1. Cost of a sort . If there are n tuples in a relation, the cost of sorting it will be 0(n log n) and a substantial amount of temporary storage will be required for the duration of the sort. Also, no other users will be able to access the rela- tion while it is being sorted. 2. Number and types of queries . Some queries will be much easier to evaluate when the relation is sorted because the relation is clustered properly for that query. However, other queries might run slower due to inefficient use of I/O channels. The DBA needs to understand the query load to be able to forecast the effects of a sort on the processing of queries. 3. Number and types of updates . It could be quite expensive to insert a tuple in the middle of a relation to keep it sorted. Or in the example above, given a social security number, it would be quite cheap to change the name field. 4. Cost of storage . The DBA should have the option of storing the relation at a backup site sorted on a different domain. Now a higher percentage of the queries will be cheap to answer but the overall cost of updates will increase. Index Selection . The idea behind an index is to remember part or all of the tuples resulting from a query. In the example in the previous section, indices could be maintained for those domains on which the output is sometimes sorted. Considerable I/O will still be required 56 for those queries but much less than the cost of a sort. The DBA has to consider this savings as opposed to the increased cost of updates and the cost of storage when deciding what indices to add. Views . If a query were asked often enough, it would become reasonable to save the answer to the query in a view, reducing the cost of that query the next time it was asked. Unfortunately, the cost of reflecting an update to one of the relations on which the view is based could be quite expensive. The cost would probably be a function of the number of joins involved in the query. Experience has to be gained with this tool before it can be properly evaluated. 57 Research Affecting the Experimental System Previously Described Topics The purpose of this section is to give a concise summary of the research required to support the experimental system. It should be emphasized that the experimental system is not simply a software pro- ject. Many of the component areas of the system are major research topics in their own right. The Research Plan (CAC Doc. No. 164, JTSA Doc. No. 5510) discussed several of these topics in detail. For com- pleteness, Table 2, a summary derived from the Research Plan, is included. Topic Research needed Synchronization Application of operating system and data base synchronization methods. Name space management Extension of name space to allow network-wide naming of logical entities. Network file allocation Algorithms for determining suitable locations for the files. Multiple copy management Design and analysis of the basic and backup algorithms. Data compression Application of existing techniques. Data clustering Design and analysis of fast and effective techniques. Deadlock Application of existing techniques. Automated structure design Analysis of data required and algorithms for automated data structure design. Table 2 Research topics covered in the Research Plan 58 There is another group of topics that is more specific to the particular approach we have taken. A discussion of these topics follows Relational Operators Suitable for Network Use Research needed: design and implementation of relational operators better suited to network environments. The current defini- tions of operators for the relational algebra rely upon the existence of two complete relations. In a network environment where effective com- munication rates are on the order of 20-40 Kbits/sec, it could poten- tially take a very long time to send a complete relation to another host. The thrust of this area of research will be to define a rela- tionally complete set of operators which will allow the same operations to proceed before a complete relation is transmitted. For example, our initial studies suggest that by sorting the relations in a join on the joining domain, the resulting join relation can be produced a tuple at a time. To do an unordered join requires 0(m*n) work where m and n are the number of tuples in the two joining relations. A sort on a relation of k tuples requires 0(k log k) work. A join in which both relations are ordered requires, in the normal case, 0(m 4- n) work, i.e. a simple pass through each. Imagine two relations A and B which are both ascending sorted on a domain. To join them on that domain we chose one, say A, look at the top tuple and compare the values of the joining domain with the corresponding domain in the top tuple of B. If they are equal, the concatenated tuple is in the join, and should be output. If A is less, that tuple of A cannot possibly be in the join, so we discard it, making the next tuple of A the top (i.e. pop one tuple of A). If top of A is greater than top of B, pop B. 59 The join can be produced with 0(m log m + n log n + m + n) work, with partial, intermediate results being available. Considerable analysis should be performed before we can be assured that such a "streamed" join is really suitable, but this is the direction the research should take. Views and Inverse Views Research needed: analysis of techniques for implementation. As discussed above, the concept of a view is an attempt to provide the user with a better interface to the system. By means of a view the user can have a data base that is very well suited to his own particular needs. Views may be used for access control by deleting from the user's view anything he is not permitted to see. The most important area of research on views is how they may be mechanized. In some research systems views are created by processing each query through what is essentially a generalized macro processor. In principle, the distri- buted DMS environment allows views to be physically created. This is because a distributed DMS is designed to handle multiple copies of relations. Physical creation of the view may be undesirable from a practical standpoint, since large amounts of storage may be required to support the views. There are also some theoretical problems regarding update anomalies. It is just not possible to perform certain classes of updates to a view. These anomalies occur when the underlying relations are such that no combination of changes to them will result in the desired change to the view. Since the notion of an inverse view is a new one, it needs to be explored more completely. The issues are mainly in the area of management. If the inverse views are disjoint, the situation is simplified. However, there still may be problems in directing updates 60 and queries to the proper host. When the inverse views are not dis- joint, i.e. some tuples are represented in two or more inverse views, one faces the problem of update synchronization discussed throughout this document. It may well turn out that the notion of an inverse view is superfluous or inadequate. Research is needed to explore the limits of the concept. Dynamic Optimization of Retrieval Strategies Research needed: design of techniques for the optimization of both global (multiple relation) and local (within a single relation) retrieval strategies. We have presented a very brief discussion of some of the global optimization techniques in the section on the network data manager. A similar, brief discussion on local optimization is found in the section on the local data manager. The research needed is to per- form a more thorough analysis of the possible techniques for retrieval strategy optimization, and to apply these results to the experimental system. Since the optimization must be performed on a dynamic basis for each query, it must operate quickly and effectively, lest the optimiza- tion cost more than it saves. 61 Implementation Considerations Manpower The initial design and implementation of the system will require approximately 50 man-months of effort over the course of the 13.3 month follow-on contract. Four different skill levels are re- quired. The principal investigator should be a seasoned researcher who can direct the progress of the overall research program. The principal investigator will devote approximately half of his time to the experi- mental system effort and half his time to the complementary modeling effort. The second skill level required is a senior investigator. The senior investigator will be responsible for the coordination and manage- ment of the activities making up the experimental system effort. He should be capable of independent, creative research, as well as assuming management responsibilities for the experimental system design and implementation. A senior investigator will be devoted full time to the experimental system. The third skill level is systems analyst. Systems analysts should have broad competence in the design and implementation of software systems. Strong expertise in at least one theoretical or empirical area is necessary. Between two and three full-time systems analysts will be required. The fourth skill level is graduate research assistants. They should be Ph.D. or Masters candidates in computer science. One or two graduate research assistants will be utilized. Schedule The initial phases of the experimental system effort will be devoted to the design of the system. As portions of the design are finalized, implementation will begin. The design effort will commence 62 immediately. A complete, though preliminary, system design should exist around the first of the year, i.e. four to five months into the contract, This design will be circulated for comments. An important component of the system design will be the specification and analysis of the basic algorithms for managing a multiple-copy, distributed DMS. Actual imple- mentation of the system will begin fairly early, around two to three months into the contract. In these early phases, implementation will be at a low level of activity. Shortly after the completion of the design, implementation will become the dominant activity, and will remain so through the end of the contract. In a previous section we discussed the impact of the other research activities on the experimental system. The estimated timings in this section could be substantially altered by major difficulties or breakthroughs on any of these research topics. 63 UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (Whtn Data Entered) REPORT DOCUMENTATION PAGE READ INSTRUCTIONS BEFORE COMPLETING FORM I. REPORT NUMBER CAC Document Number 170 JTSA Document Number 5512 2. GOVT ACCESSION NO 3. RECIPIENT'S CATALOG NUMBER 4. TITLE (and Subtitle) Research in Network Data Management and Resource Sharing - Preliminary Experimental System Design Report 5. TYPE OF REPORT ft PERIOD COVERED Research Report - Interim 6. PERFORMING ORG. REPORT NUMBER CAC #170 7. AUTHORf*; Steve R. Bunch David C. Healy Edwin J. McCaulev 8. CONTRACT OR GRANT NUMBERf*) DCA100-75-C-0021 9. PERFORMING ORGANIZATION NAME AND ADDRESS Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 10. PROGRAM ELEMENT, PROJECT, TASK AREA ft WORK UNIT NUMBERS 1 I. CONTROLLING OFFICE NAME AND ADDRESS Joint Technical Support Activity 11440 Isaac Newton Square, North Reston, Virginia 22090 12. REPORT DATE August 20, 1975 13. NUMBER OF PAGES 67 14. MONITORING AGENCY NAME ft ADDRESSf// different from Controlling OHlce) 15. SECURITY CLASS, (ot thia report) UNCLASSIFIED 15a. DECLASSIFI CATION/ DOWN GRADING SCHEDULE 16. DISTRIBUTION STATEMEN T (ot thia Report) Copies may be requested from the address given in (11) above. 17. DISTRIBUTION STATEMENT (of the abstract entered in Block 20, It different from Report) No restriction on distribution. 18. SUPPLEMENTARY NOTES 19. KEY WORDS (Continue on reverse aide if necessary and identify by block number) Distributed data management Relational model Data management 20. ABSTRACT (Continue on reverse side if necessary and identity by block number) The basic design of an experimental distributed data management system is presented. The system is based upon the relational model. DD FORM 1 JAN 73 1473 EDITION OF 1 NOV 65 IS OBSOLETE UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) •7 V" «>.^