Bn MAlHEMATiC « N TRMC 1R C U lAT 1 ON S DBOOKSTACKS The Person borrowng this «* before be charged a minim each non-returned o ^ ^ , ^^l-*r^5«-»- o,-B,, f \llinoi. and ore pr law and Procedure. _. 333-8400. - TO BMW; ejuBiJ u*—*-* APR 25 below previous due date- Digitized by the Internet Archive in 2013 http://archive.org/details/relationaldataba913kimw ^^ y^rt No. fl/aTZ :&f UIUCDCS-R-78-913 UILU-ENG 78 1703 RELATIONAL DATABASE SYSTEM IMPLEMENTATIONS — A SURVEY by Won Kim .5? o January 1978 NSF-OCA-MCS73-07980-000031 m DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS University of Illinois ^na-ChamHUgfl Report No. UIUCDCS-R-78-913 RELATIONAL DATABASE SYSTEM IMPLEMENTATIONS — A SURVEY by Won Kim January 1978 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US NSF-MCS73-07980. ACKNOWLEDGEMENT This author is deeply indebted to Professor David J. Kuck for the numerous helpful discussions and suggestions on the subject mat- ters of this paper. ABSTRACT This paper surveys the overall structure and design of some of the prototype relational database systems frequently referenced in the literature, and then examines in some detail some inter- esting approaches that have been taken in implementing various requisite features for a comprehensive relational database sys- tem. TABLE CF CONTENTS 1 . Introduction 2. System Architecture and Design 3. Storage Structures and Access Strategies 4. Query Processing and Optimization 5. Support of User Views 6. Integrity and Authorization Control 7. Concurrency Control 8. System Checkpoint and Recovery References 1. INTRODUCTION A series of papers by E.F. Codd between 1970 and 1971 (1, 2, 3. k) provided a great impetus to the development of relational database systems and associated research activities. A number of prototype systems were implemented to demonstrate the feasi- bility for supporting high level, non-procedural, set-oriented data sublanguages. At the same time a number of systems were developed to serve as database access and storage subsystems for implementing high level relational data sublanguages. More recently efforts have been directed toward implementing compre- hensive relational database systems that incorporate solutions to various specific problems that have been identified. Table 1.1 lists a number of prototype relational systems that are fre- quently referenced in the literature. A comprehensive relational database system must not only provide interfaces for high level, non-procedural data sublanguages for both the application programmers and remote terminal users, but also support various users' views of the stored database, enforce data integrity and authorization constraints, control concurrent updates of a shared database, and provide transaction and system recovery. Efficient data access strategies and storage struc- tures and query processing techniques must also be employed to meet the response-time requirements of on-line terminal users. Also a good report generator should be provided to display the results of a query in a highly stylized format. Readers wishing for a tutorial on relational database are refer- red to Date (6) and Chamberlin (7). Table 1.1 Some Prototype Relational Database Systems MADAM (MacAIMS Data Management System) — Project MAC, MIT RDM3 (Relational Data Management System) -- MIT DAM AS (Data Management System) — MIT PDMS (Relational Data Management System) — General Motors Research Lab., Warren, Mich. IS/l (Information System 1) — IBM UK Scientific Centre, Peterlee, England PRTV (Peterlee Relational Test Vehicle) -- IBM UK Scientific Centre, Peterlee, England RM (Relational Memory) — IBM Cambridge Scientific Center, Cambridge, Mass. XRM (Extended Relational Memory) — IBM Cambridge Scientific Center, Cambridge, Mass. QUERY -BY -EXAMPLE -- IBM Yorktown Heights Research Lab., N.Y. SEQUEL -- IBM San Jose Research Lab., Calif. GMI3 (Generalized Management Information System) — IBM Cambridge Scientific Center, Cambridge, Mass. MIT Sloan School of Management, Cambridge, Mass. 3QUIRAL (Smart Query Interface for Relational Algebra) — University of Utah RENDEZVOUS -- IBM San Jose Research Lab., Calif. ZETA/TORUS (Toronto Understanding System) -- University of Toronto INGRES (Interactive Graphics and Retrieval System) — University of California at Berkeley SYSTEM R -- IBM San Jose Research Lab., Calif. ^his paper first describes the overall architecture and design of some of the prototype systems, and then examines some of the approaches that have been taken in implementing the various re- quisite features for a comprehensive relational system. The illustrative examples used in this paper are based on the following database of employees and their departments i EMP (NAME, DNO, JOB, SAL, AGE, MGR) » primary key is NAME DEPT (DNO, LOC, NEMPS) ; primary key is DNO Each EMP tuple contains the employee name, department number, job name, salary, age, and manager's name. Each DEPT tuple contains the department number, location (floor), and the num- ber of employees in the department. Some of the relational database terms used in this paper are listed in Table 1.2 along with their equivalents found in the literature. Table 1.2 Some Relational Database Terms base relation -- schema, data model, table, stored relation view — subschema, external data model, derived relation, implied relation, virtual relation, defined relation column, field -- attribute, domain, component tuple -- row, entry range variable -- tuple variable, entry variable selection — restriction projection join -- semi join, natural join, composition statistical or aggregate function — libray function, built-in function 2. SYSTEM ARCHITECTURE AND DESIGN This section describes the overall architecture and design of a number of prototype relational systems. Systems treated in de- tail include MADAM, IS/l, PRTV, RM, XRM, ZETA, INGRES, and SYS- TEM R. Other systems that are mentioned in Table 1.1 are only briefly described below. General 'Motor's RDMS (8) supports a data sublanguage based on the relational algebra. It has been used in several applica- tions for engineering and commercial data analysis. MIT's RDMS (9) is an upgraded version of MADAM, and is widely in use with- in the administrative departments at MIT. MIT's DAMA3 (10, 11) and Utah's SQUIRAL (12) have made interesting contributions to the area of query optimization (see section k) . IBM's SEQUEL (13, I**-, 15) and QUERY-BY-EXAMPLE (16, 1?) implement two of the best developed high level, non-procedural data sublanguages, SEQUEL (18) and QUERY -BY -EXAMPLE (19), respectively. Both sys- tems use XRM as data access and storage subsystem. GMI3 (20), which is being used at MIT for modeling New England energy re- sources, was developed by extending the SEQUEL prototype to ac- commodate a multi-user environment. RENDEZVOUS (21) is a natu- ral language understanding system that supports a relational view of data. 2.1 MADAM (MacAIMS Data Management System) (22, 23) MADAM, developed as a part of Project MacAIMS at MIT's Project MAC, is generally acknowledged to be the first system that sup- ported an n-ary relational model of data and a set-oriented data sublanguage based on the relational algebra. MADAM was imple- mented on MULTIC3 in PL/l , the language in which MULTIC3 itself was written. Fig. 2.1.1 shows the overall structure of MADAM. The user-oriented subsystem consists of a query language inter- preter and display handler. The division of the system into a set of procedure and data segments concerned with sets of data elements (e.g., name, salary, job, department number) and another set of procedure and data segments concerned with sets of rela- tions (e.g., EMP, DEPT) is the consequence of the view taken by the designers of MADAM that information one may store in a data- base consists of sets of data elements and sets of relations among the data elements. Associated with each type of data elements are a data segment and a data element module. Associated with each possible data structure for storing a relation (e.g., tree, linked list, array) are a relation data segment and a relation strategy module. Fig. 2.1.1 is more fully explained in section 3,1. 2.2 Peterlee IS/l (24) IS/l was the first relational prototype implemented by IBM. This system, developed at the IBM UK Scientific Centre, Peterlee, England, supports a query language based on the relational alge- bra and was implemented in PL/l. Fig. 2.2.1 illustrates the architecture of the system. In the syntax section, a user query is translated into a string of intermediate codes. This string is a procedure call and asso- ciated arguments. The procedure section consists of a set of PL/l procedures that support the set of user commands. A simple interpreter is pro- vided to serve as the interface between the syntax section and the procedure section. The interpreter isolates the arguments from each string of intermediate codes, pushes them onto an arg- ument stack, and passes control to the procedure being called. The database facilities section consists of a set of PL/l pro- cedures that access and store views and base relations. IS/l was designed to allow the users to easily extend the func- tional capabilities of the base system by writing their own PL/l procedures and linking them to the system. The syntax section is extended as a side effect of the procedure section extension. IS/l supports user views on the base relations by either storing their definitions in strings of intermediate codes (implicit views) or by maintaining the materialized sets of tuples for the views as separate base relations (explicit views). This last point is elaborated upon in section 5.1. 2.3 PRTV (Peterlee Relational Test Vehicle) (25) PRTV was developed at the IBM UK Scientific Centre, Peterlee, England, based on their experience with IS/l. This system was implemented in two major sections called the top end and the bottom end. 8 The top end was implemented primarily in the string processing language MP/3, and partially in PL/l . This section of the sys- tem parses the user language, processes and maintains the sys- tem directories, performs the mapping between relations and storage structures, links user-provided PL/l procedures, and performs query optimization. The designers of I3/l and PRTV feel very strongly that although data sublanguages based on the relational calculus or relational algebra are "relational- ly complete (4)," they are by no means "functionally complete." Thus 13/1 and PRTV allow the users to extend the system to suit their application requirements that system designers may not foresee. The query optimization techniques implemented for PRTV are due to the works of Hall (26), and are described in some detail in section 4.1. The bottom end was implemented in a mixture of PL/l and IBM 370 assembler language. This section consists of a set of proce- dures that perform all storage and relational operations on the database. The user language PRTV supports is based on the relational alge- bra. However, this query language requires a sufficient know- ledge of algebra on the part of the users, and thus restricts the usability of PRTV as a stand-alone system. In fact, PRTV was used as a low level subsystem for the Urban Management Sys- tem which created a database of 60 megabytes with relations of more than 60,000 tuples and degrees as high as 200. This data- base is thought to be the largest relational database to have had regular use (one year). Fig. 2.1.1 Architecture of MADAM Relation Strategy Modules names jobs salaries procedure segment column names data segment relation EMP relation DEPT Fig. 2.2.1 Architecture of IS/l algebra-based user language syntax section interpreter procedure section database facilities syntax extension procedure extension views 10 2.4 RM (Relational Memory) XRM (Extended Relational Memory) (2?» 28, 15) Both RM and XRM were developed at the IBM Cambridge Scientific Center for use as low-level, procedural relational interfaces for host language systems, or as intermediate interfaces in im- plementing high-level, non-procedural relational languages. RM allows variable length byte strings to be stored and accessed by numerical identifiers in a set of pages called the "entity space." Binary relations whose data elements are either inte- gers or numerical identifiers for the byte strings stored in the entity space are stored and operated on in another set of pages called the "relation space." Note that RM supports only binary relations. Binary relations are logically sufficient, for relations of higher degrees can be straightforwardly decom- posed into sets of semantically equivalent binary relations* Binary relations, however, cannot adequately support large, formatted databases (chapt. 11 of 6). XRM was developed by extending RM to support n-ary relations, n-ary tuples are stored in the RM entity space, and inversions that provide efficient associative access to these n-ary tuples are stored in the RM relation space. A detailed discussion of XRM is given in section 3.3* 2.5 ZET A/TORUS (29, 30) ZETA is a relational prototype under implementation at the Uni- versity of Toronto. Three principal levels constitute this sys- tem, as shown in Fig. 2.5.1, 11 The lew-level, called MINIZ, implements elementary commands that directly eperate on single relations. These include commands te create and destroy relations, insert and delete tuples, retrieve and update columns within a tuple, and mark selected tuples of a relation. The most interesting command at this level is the mark command. A mark is a unary relation (i.e., a list) of tuple identifiers of those tuples of a relation satisfying a selection operation. This facility is employed in the intermediate level for the support of views (see section 5*2) and for the implemen- tation of queries involving two or more relations. The intermediate level, consisting primarily of the EXECUTOR module, performs the translation of the intermediate form of the user commands into a sequence of lew level commands. The EXECUTOR consists of an interpreter, a set of "schema" proce- dures, and a set of utilities. The interpreter breaks down the intermediate form of a user command into a sequence of utility operations. For each lew level command, a corresponding utility is provided at this level. The interpreter invokes the schema procedures whenever data pertaining to user views or base rela- tions are to be retrieved or accessed. The high level ef ZETA consists of a host-programming language system (HLS) and a self-contained language system (SLS). The HLS supports a SEQUEL-like data sublanguage embedded in Pl/l . The SLS provides a syntax- table-driven compiler-compiler to tailor a SEQUEL-like stand-alone query language for particular applications. 12 The low level was used as a back end database system for a nat- ural language understanding system called TORUS (Toronto Under- standing System). 2.6 INGRES (Interactive Graphics and Retrieval System) (31) INGRES, developed at the University of California at Berkeley, is implemented on top of the UNIX operating system in C, the language in which UNIX itself is written. INGRES supports QUEL (Query Language), a high level language based on the relational calculus for terminal users; EQUEL (Embedded QUEL), QUEL embed- ded in C for application programmers; and CUPID, a graphics- oriented casual user language. INGRES is implemented as a sequence of UNIX processes as shown in Fig. 2.6.1. Communication between a UNIX process and the tree of subprocesses it spawns is accomplished via one-direction communication links called "pipes." Each pipe is written into by one process and read by another. P rocess 1 is created when INGRES is invoked from the UNIX. It should be noted that INGRES is treated as a user process under the UNIX operating system environment. Process 1 is the inter- active terminal monitor which maintains a workspace for a user to formulate and edit a collection of INGRES commands. The con- tents of the user workspace are passed down pipe A as a string of ASCII characters for execution by later processes. Process 2 translates the input query passed down pipe A into a parse tree using a lexical analyzer and a parser. The qualifi- 13 cation portion of the query is converted to an equivalent Boole- an expression in conjunctive normal form. A conjunctive normal form expression is amenable to the query modification techniques that INGRE3 employs for the support of user views and authoriza- tion and integrity control (32, 22). Query modification is sim- ply the merging of the parse tree for a query with the stored parse trees for the view definitions, and authorization and inte- grity constraints. Section 6.2 gives an example of the query modification technique in action. Process 2 also contains rou- tines for a crude concurrency control. It is noted here that views and selective authorization control have not been imple- mented, and that only simple integrity assertions involving single range variables and no aggregate functions (e.g., average, sum, count) are currently enforced. Process 3 accepts from process 2 a string of tokens representing a user query, and performs the following two functions i 1. decomposition of a query involving more than one range variable (multi-variable query) into a sequence of inter- actions involving only one range variable (one-variable query). For details, see (3*0 and section 4.2. 2. processing of a one-variable query. Retrieve commands are completely processed in process 3» Update commands (Replace, Delete, and Append), however, are converted to appropriate Retrieve commands to simply isolate the qualify- ing tuples. These tuples are then spooled onto a temporary file which is passed to process 4 for the actual update operations. 14 The reasons for this deferred update are explained in section k.2. All utility commands (e.g., create and destroy base rel- ations and secondary indices) are simply passed down to process 4. Note in Fig. 2.6.1 that the results of a retrieve command are returned in a stylized format directly to the user termi- nal. Process k is a collection of overlays that support deferred up- date and all utility commands. Note in Fig. 2.6.1 that all commands are passed to the right via pipes A, B, and C, whereas all error messages are passed to the left via pipes D, E, and F. When INGRES is invoked from an EQUEL program, the executable module for the EQUEL program (produced by a pre-compiler and a compiler) replaces the interactive terminal and process 1 of Fig. 2,6.1. The size of each INGRES process largely explains the particu- lar process structure chosen by the designers of INGRES. Table 2.6.1 INGRES Space Requirements (in bytes) Access Methods (AM) — UK EQUEL (precompiler, compiler, AM) — 4lK Process 1 — 10K Process 2 and AM — 5^K Process 3 and AM — 56K Process 4 (AM, 8 overlays) — 171K 15 The size of a UNIX process is limited to 64K bytes on a PDP llAo, and to 128K bytes on a PDP llA5 or 11/70. The INGRES designers believe that INGRES should appear as a user job to UNIX so that INGRES would operate on a standard UNIX. Consequently, INGRES contains no facilities for mem- ory management and I/O? UNIX performs these functions. This design decision should be contrasted with that for SYSTEM R (see below). Readers wishing for a more detailed account of the UNIX operating system should consult (35) • 2.7 SYSTEM R (36) SYSTEM R, developed at the IBM San Jose Research Laboratory as a vehicle for research in relational database, is by far the most sophisticated and comprehensive prototype system implemented. This system provides extensive data control fa- cilities, including selective authorization, integrity assert- ions, transactions as units of consistency (see section 6.3), This system implements much of the results of the recent in- vestigations in concurrency control (37 • 38, 39) • including such notions as the levels of consistency, granularity of locks, and locking at the logical level as well as at the physical level (see section 7*2). The system also provides facilities for transaction backout, and transaction and sys- tem recovery (see section 8.2). SYSTEM R, rather than rely- ing on the automatic paging of virtual memory in VM/370, pro- vides its own memory management and I/O facilities. This de- 16 cision was made to satisfy the implementation requirements for a novel recovery scheme (see section 8.2). Fig. 2.7.1 shows the overall structure of SYSTEM R. The RDI (Relational Data Interface) is the principal user interface, consisting of a set of operators for data definition and mani- pulation and a set of operators for database locking and trans- action. A special RDI operator called SEQUEL is provided to support terminal users and PL/l host language system users. Users are allowed to write programs on top of the RDI for the support of other relational interfaces (e.g., QUERY BY EXAMPLE) or hierarchical or network interfaces. The RD3 (Relational Data System) is the subsystem that implements the RDI operators. The RD3 provides authorization and integrity control facilities, supports various users' views of the data- base, maintains the catalogs of external names, and performs an access path optimization. The RSI (Relational Storage Interface) is the internal interface which provides operators for tuple-at-a-time access to base rel- ations, and also operators for data definition, data recovery, and transaction management. The RSS (Relational Storage System) is a complete storage subsys- tem that supports the RSI operators. The RSS maintains access paths to the base relations, and secondary indices on selected columns of the base relations. The RSS maintains transaction consistency by performing locking, deadlock detection, and back- out. The RSS also performs memory management and transaction and system recovery. 17 Fig. 2.5.1 Architecture of ZETA HLS S v / 3LS 7ZZ. high level EXECUTOR interpreter schema procedures utilities intermediate level MINIZ basic operations on relations low level Fig. 2.6.1 Process Structure of INGRES A, B, C, D, E, and F are pipes. Fig. 2.7.1 Architecture of SYSTEM R SEQUEL SEQUEL embedded in PL/l stand-alone emulators for hierarchical, network, or other relational interfaces Relational Data Interface Relational Data System Relational Storage Interface Relational Storage System 18 3. STORAGE STRUCTURES AND ACCESS STRATEGIES This section examines the storage structures and access strate- gies used in MADAM, PRTV, RM, XRM, INGRES, and SYSTEM R. 3.1 MADAM The following discussion refers to Fig. 2.1.1. Each type of data elements (e.g., name, salary, job, column name of a relation) is associated with a different DEM (Data Element Module) and a different DES (Data Element Segment). Furthermore, each type of data structure for storing a relation (e.g., tree, list, array) is associated with a different R3M (Relation Strate- gy Module) and a different RDS (Relation Data Segment). A DEM converts appropriate type of data elements from their nat- ural form to their standard form in which all data elements are stored. For example, the standard form for data elements of type "date," "11/25/77" or "Nov. 25, 1977," may be "771125." Since most data elements are either character strings or inte- gers, a D3M (Data Strategy Module) for character strings and a D3M for integers are provided. A DSM generates a unique refer- ence number for each data element. The reference number for an integer is the integer itself, whereas the reference number for a string of characters is determined by an algorithm which pre- serves the lexical ordering between different character strings. The fixed-length (one word) reference number is used in all sub- sequent operations on the data element. Since the reference numbers preserve the lexical ordering between the data elements 19 they represent, the frequent comparison and ranking operations can be directly performed on them. Moreover, these operations can be performed much more quickly on the single-word long ref- erence numbers than on the variable-length data elements. Within a DES (Data Element Segment) all data elements of one type are stored in a binary search tree. The designers of MADAM take the view that for each type of data structure for storing a relation a different set of procedures should be provided to support operations on the relation. Each RSM is designed to perform all database operations on only one type of data structure. A potential difficulty of this approach arises when it becomes necessary to operate on two relations stored in different types of data structure. A canonical form storage structure for relations is therefore defined as the standard two dimensional array representation of a relation. Each RSM is allowed to accept a relation stored in canonical form and produce the output in canonical form. Thus when it becomes necessary to operate on two relations stored in differ- ent storage structures, one of the relations is converted to canonical form by its RSM. The RSM for the other relation then performs the actual operation using the canonical form as one of its inputs. All data elements and domain names of a relation are uniformly referred to by their reference numbers. 20 3.2 PRTV PRTV implements various techniques to increase storage space uti- lization and to reduce data retrieval time. These include the use of a standard format, sorting, and suppression and compres- sion of data. PPTV stores all relations sorted on the values of some column. Sorting makes the execution of operations such as join more ef- ficient, and allows for the suppression of duplicate leading fields. Data suppression is performed as follows i In each record, lead- ing fields are suppressed if they are identical to the corres- ponding fields of the preceding record; and the remaining fields are stored along with an indication of the number of fields act- ually stored. For example, the records shown in Fig. 3.2.1 are suppressed and stored as shown in Fig. 3*2.2. Fig. 3.2.1 Records Before Suppression 12 2 1 2 3 1 1 3 2 1 3 Fig. 3.2.2 Records of Fig. 3.2.1 After Suppression 12 2 1 3 2 13 3 2 13 21 PRTV performs data compression as follows i The first record in each block of records is stored in its entirety; in each of the subsequent records, all fields that have not been suppressed are Exclusive-Ored with the corresponding fields of the first record i only the non-zero bytes that result from this operation are stored, along with a bit map that indicates the positions of these bytes. For example, the records of Figs. 3.2.3a and 3.2.3b are Exclusive-Ored to yield the record shown in Fig. 3.2.3c. Fig. 3.2.3 PRTV Data Compression a) 01011111 01011100 01100101 01100011 b) 01011111 01110101 01100101 01000011 XOR 00000000 00101001 00000000 00100000 c) 01010000 00101001 00100000 bit map 3.3 RM, XRM RM uses one set of pages to store data elements, and another set of pages to store binary relations. A data element is referenced by a numeric identifier consisting of a page address and an off- set within the fixed header area in the page (see Fig. 3«3«1)« The referenced header area position contains the actual byte ad- dress for the data element. A binary relation identifier also consists of a page address and an offset within the page header area. The header area position contains the byte address of the first tuple of the relation. RM maintains a linked list of all tuples of a relation in increasing order of the left column values 22 of the tuples. All tuples with the same left column values, in turn, are represented in a linked list (linked to the common left column value) in increasing order of the right column values (see Fig. 3.3.2). XRM stores all n-ary tuples of a relation in the RM entity space (set of data pages). Each field of an n-ary tuple is represent- ed as a fixed length integer, which is the actual value for an integer column, and which is the numeric identifier for a string of characters stored in another data page for a non-integer col- umn. XRM provides a class relation for all data elements belonging to the same column. This is implemented via a binary relation in RM relation space (set of relation pages). Each tuple of this relation consists of the hashed value of a data element and the numeric identifier for the data element. A class relation is used for quick access from the value of a data element to its numeric identifier. XRM also maintains a binary relation for each n-ary relation so that a particular tuple of the relation may be directly ac- cessed given its primary key value. Each tuple of this binary relation contains the hashed value of the key and the identifi- er of the n-ary tuple. Rapid access to a tuple with a particular value in a particular field is also supported by an inversion relation . There is also a master relation which contains information about all other rel- ations. The relation identifier of a relation is then the iden- tifier of a tuple in the master relation which describes the rel- ation. 23 Fig. 3.3.1 RM Data Page Layout TID1 TID2 header area TID 1 tuple identifier Fig. 3.3.2 RM Relation Page Layout header area RID 1 relation identifier vl< v2 < v3 v5< v6 Fig. 3.^.1 INGRES Page Layout for Keyed Storage Structures next primary page next overflow page next line number tuple 1 tuple 2 TID1 TID2 Mine table TID 1 tuple identifier zk Fig. 3.^.2 Structure of a Keyed INGRES File primary pages a primary page overflow pages next primary page next overflow page next line no Fig. 3.5.1 SYSTEM R Storage Structure overflow tuple TID header area 25 3.4 INGRES INGRES supports five storage structures. Four of these are keyed, that is the storage location of any tuple within a file that contains a relation is determined as a function of the value of the tuple's primary key. These are termed hashed, ISAM, compressed hashed, and compressed ISAM. The non-key structure is termed a heap, and stores tuples of a relation independently of their key values. The non-key structure involves a low overhead, and is intended for relations of small cardinality, or temporary relations cre- ated as intermediate results during query processing. New tuples are simply appended to the end of a file in the order supplied, and their byte offsets within the file become their tuple identifiers. The four keyed storage structures share a common page layout shown in Fig. 3,b,l, A tuple identifier consists of a page ad- dress and a line number in the fixed line table (header area in RM/XRM pages). The referenced entry in the line table con- tains the byte address of the tuple. The "next primary page" pointers chain together all primary pages. A relation is init- ially represented as a linked list of primary pages in a file. The "next overflow page" pointers link together all overflow pages containing tuples that do not fit in the primary pages. The general structure of a keyed file is illustrated in Fig. 3.^.2. Each keyed file, consisting of primary and overflow pages, is dedicated to the storage of a single relation. This decision not to cluster tuples from different relations that 26 may be accessed together was made so that the INGRES access meth- ods and query decomposition procedures (see section 4.2) would be manageable. The decision was also a direct consequence of the fact that INGRES provides no memory management or I/O facili- ties of its own. Logically adjacent pages in a UNIX file are not necessarily physically adjacent, and only a small number of tuples from different relations may be grouped in a page of 512 bytes that UNIX supports. In a hashed file, tuples are distributed randomly throughout the file according to the values of their primary keys. In an ISAM file, tuples of a relation are initially sorted to produce an ordering on a particular key. The heighest key values in all the primary pages comprising the file are recorded in a multilevel directory in several pages following the primary pages within the file itself. This directory is static, that is, it does not change during subsequent insertion, deletion, and update of tuples. Tuples in a primary page and its overflow pages are not maintained in sort order. An INGRES ISAM file is similar to IBM's ISAM. A good discussion on ISAM can be found in Chapter 7-6 of (40). An ISAM file is useful in situations where the key value is spe- cified as falling within a range of values, whereas a hashed file is appropriate when access is conditioned on a specific key value. In both keyed storage structures mentioned above, fixed length tuples are stored. INGRES implements a compressed hashed file 27 and a compressed ISAM file. These schemes use a hashed file and an ISAM file in conjunction with some data compression tech- niques (41). Currently blanks and portions of a tuple which are identical to the preceding tuple are suppressed. The com- pressed files are useful when an increase in storage utiliza- tion outweighs the overhead involved in encoding and decoding the variable length field values of tuples. 3.5 SYSTEM R The SYSTEM R storage structure is illustrated in Fig. 3* 5.1. A segment is an address space in VM/370 virtual memory. Each segment consists of a set of pages. A page is the unit of data transfer between the main memory and secondary storage devices. Each segment may contain many relations! however, each relation is entirely contained within a segment. Unlike in INGRES, each page may contain tuples from different relations, so that extra page accesses may be avoided when tuples from different relations are accessed together? however, each tuple is entirely contained within a page. Some pages are reserved for the storage of tuples of user rel- ations (data pages), while other pages are intended for the storage of index structures and system catalogs. Also several types of segment are supported to satisfy different requirements with appropriate overheads. For example, one type of segment is used for the storage of shared data. Facilities must be pro- vided for concurrent access, transaction backout, and segment 28 recovery for the data of this segment. A different segment type is used for the storage of temporary relations generated as intermediate results of a query processing. The overhead incur- red for the support of a segment for shared data is not necessary for this type of segment. Another type of segment is used for maintaining the access path structures or internal catalogs. Each tuple of a relation is stored as a contiguous sequence of field values along with a prefix. The prefix contains the rela- tion identifier, the number of fields in the tuple, pointer fields that implement the links (see section 4.3), and the num- ber of pointer fields. As in RM/XRM and INGRES, each tuple is associated with a tuple identifier which is a concatenation of a page number and an offset within a header area. Tuples may be moved within a page to allow space compaction, and their new byte addresses entered in their slots in the fixed header area. Thus the tuple identifiers generated before the space compac- tion correctly reference the corresponding tuples after the space compaction. In case a tuple is updated to a longer total value and space is no longer available in the current page, it is tagged with a new tuple identifier in a nearby page (see Fig. 3.5.1). If the new tuple overflows again, the original tuple is tagged with a tuple identifier in a page that does not cause the tuple to overflow. A tuple access via its tuple identifier thus involves at most two page accesses. Each tuple identifier is generated and used by the system (RSS) to refer to tuples from index structures and to maintain pointer chains. This is further discussed in section 4.3. 29 k. QUERY PROCESSING AND OPTIMIZATION A number of interesting works have been done on optimizing the implementation of high level data sublanguages based on the rela- tional calculus or relational algebra. In (4-) Codd presents an algorithm, called "Codd's reduction algorithm," for translating a relational calculus expression into a sequence of relational algebra operations. Palermo made certain improvements in the efficiency of the reduction algorithm and implemented the alge- bra operators in APL (^2). Smith and Chang (12), Hall (26), and Pecherer (4-3) have explored techniques for translating an alge- braic query expression into an equivalent, but more efficient, form. Astrahan and Chamberlin (15) have proposed techniques for decomposing the block-structured SEQUEL statements. Rothnie (10, 11) and Wong and Youssefi (3*0 have implemented iterative decom- position techniques for reducing a calculus-based query involving multiple range variables into a sequence of queries involving only single range variables. Senko, et. al. (44) and Tsichritzis (4-5) have proposed schemes in which a non-procedural expression is compiled into a set of procedures in a hierarchical or network oriented language. In (46) Gotlieb presents and compares a num- ber of algorithms for implementing a join. In (47) Blasgen and Eswaran present an evaluation of several techniques for imple- menting join, projection, and selection operations on the basis of their costs in secondary storage accesses. The optimizing techniques implemented for PRTV, INGRES, and SYS- TEM R are examined in this section. 30 k.l PRTV PRTV generates a query tree for a sequence of relational algebra operations specified by the user. The PRTV optimizer then reorg- anizes the query tree according to the following principles i 1. Selections are moved as far down the tree (toward the leaves) as possible for the earliest possible execution. .?. Projections not requiring the results to be sorted are also moved as far down the tree as possible. 3. Projections requiring sorted results, however, are moved as far up the tree as possible for the latest possible execution. 4. Projections of a projection on the same relation are merged into a single projection. 5. Expressions involving several set operators are rearranged according to such standard rules as commutativity and dis- tributivity . 6. Common subtrees within a query tree may be realized once into an explicit relation so that duplication of the oper- ations may be avoided. Note here that the optimizer esti- mates the cost of creating an explicit relation and the cost of duplicating the operations, and chooses the cheaper alternative. 7. The optimizer also selects among alternative implementations of the relational operators. 31 k.2 INGRES Query processing in INGRES is done in process 3 via the OVQP (one-variable query processor) and DECOMP (decomposition) rou- tines. The OVQP is invoked to completely process a query which involves a single range variable. The DECOMP is called to re- duce a query involving more than one range variable into a se- quence of one-variable queries. The DECOMP makes iterative calls to the OVQP to process the intermediate one-variable que- ries, and merges the results of these calls. The OVQP and DE- COMP are discussed in detail below. OVQP This program consists of a module called STRATEGY which selects an optimal access strategy for retrieving tuples from a single relation, and a module called SCAN which evaluates each retrieved tuple against the predicates of the query and creates the output list of values from the qualifying tuples. DECOMP Three techniques are used to reduce a multi-variable query to a query involving one less range variables! tuple substitution, one variable detachment, and reformatting. 1) tuple substitution One range variable in the query is selected, and the relation over which the variable ranges is scanned one tuple at a time. All references to the selected variable in the query are replaced by appropriate field values from the retrieved tuple. The origi- 32 nal query now has one less variable. Note that the variable which ranges over a relation with the smallest cardinality is selected for substitution. As an example, consider how the following 2-variable query is reduced by tuple substitution. (Find the names of employees who earn more than their managers) RANGE OF E, M 13 EMP RETRIEVE (E.NAME) WHERE E.SAL > M. SAL AND E.MGR = M.NAME The two range variables, E and M, in this QUEL statement both range over the EMP relation. Suppose that E is selected for substitution, and that a tuple (Jones, 2, salesman, 18K, 30, Smith) of EMP is retrieved. Substitution of Jones, 18K, and Smith for E.NAME, E.3AL, and E.MGR, respectively, yields the following one variable query « RANGE OF M IS EMP RETRIEVE (Jones) WHERE 18K> M.SAL AND Smith = M.NAME 2) one variable detachment Recall that the qualification portion of a QUEL statement is converted to an equivalent conjunctive normal form expression (see section 2.6). Let us define a simple clause to be a con- junct of the form (variable. column comparator constant). For example, E.SAL = 16K is a simple clause, but E.MGR = M.NAME is not. One variable detachment is done in two steps as follows t 33 First, for each range variable V in a simple clause C, create a temporary relation T by first performing a selection over R, the relation over which V ranges, with predicate C, and then perform- ing a projection over all columns of R that are involved in the target list and all remaining conjuncts of the qualification. For example, consider the following QUEL statement! (Find the names of employees over 50 years of age who work on the third floor and earn more than their managers) RANGE OF E, M IS EMP RANGE OF D IS DEPT RETRIEVE (E.NAME) WHERE E.SAL > M.SAL AND S.MGR = M.NAME AND E.DNO = D.DNO AND D.LOC = 3 AND E.AGE > 50 Temporary relations Tl and T2 are created for the two simple clauses in the query, D.LOC ■ 3, and E.AGE > 50. Tl is defined over DNO, while T2 is defined over NAME, SAL, MGR, and DNO. RANGE OF D IS DEPT RETRIEVE INTO Tl (D.DNO) WHERE D.LOC = 3 RANGE OF E IS EMP RETRIEVE INTO T2 (E.NAME, E.SAL, E.MGR, E.DNO) WHERE E.AGE > 50 Each temporary relation thus generated is stored in a hashed file (see section 3.*0« The hash key is chosen as follows i Non-simple clauses of the form V.D = VI. Dl in the original que- ry are collected, where V and VI are range variables, D and Dl are column names, and V ranges over temporary relation T. If this collection is empty, T is stored in a heap; otherwise, a D is arbitrarily selected. 34 Consider our example query again. For variable D, which may now range over Tl, our collection contains only one clause, D.DNO = E.DNO. Therefore, Tl is stored hashed on DNO. For variable E, which may now range over T2, our collection consists of two clauses, E.MGR = M.NAME, and E.DN0=D.DN0. T2 may then be hashed on either MGR or DNO. T2 may have to be restructured with a different hash key, if this arbitrary selection of a key turns out to be inappropriate (see reformatting below). Second, all simple clauses are removed from the original query, and all the relations over which the variables involved in the simple clauses are replaced by the newly generated temporary relations. Remember these temporary relations are horizontal and vertical subsets of the original relations. The result of one variable detachment is shown below. RANGE CF D IS Tl RANGE OF E 13 T2 RANGE OF M 13 EMP RETRIEVE (E.NAME) WHERE E.SAL > M.SAL AND E.MGR - M.NAME AND E.DNO = D.DNO 3) reformatting Consider as an example the 3-variable query above. Suppose that Tl is hashed on DNO, T2 is hashed on MGR, and Tl has the smallest cardinality. Then range variable D is selected for tuple substi- tution. When tuple substitution is made, E.DNO = D.DNO becomes E.DNC = constant, and the resulting simple clause can now be re- moved by one variable detachment. Since one variable detachment must be performed for each tuple of Tl over which D ranges, re- 35 formatting T2, over which E ranges, to be hashed on DNO may sig- nificantly improve its processing. Therefore, if T2 was hashed on MGR (see one variable detachment above), it is rehashed on DNC here. Before concluding this subsection, a few reasons for the deferred update mentioned in section 2.6 are explained. The examples used in (31) are very illuminating, and are largely transplanted here. 1 ) secondary index anomaly Suppose each employee earning under 15K is to be given a 10 per- cent raise. Suppose also that a secondary index on SAL is main- tained as shown in Fig. 4.2.1a. Fig. 4.2.1 Secondary Index Anomaly a) SAL TIP b) SAL TIP c) SAL TIP 12K Tl 13K T2 13. 2K Tl 13K T2 13. 2K Tl 14. 3K T2 Suppose update is immediate. The OVQP makes use of this index j first it updates the tuple identified by Tl, and then updates the index entry (12K Tl ) to (13.2K Tl ) to reflect this. After the next qualifying tuple, T2, is processed, the OVQP encounters Tl again. In this way a qualifying tuple may be updated an in- definite number of times! See Figs. 4.2.1b and 4.2.1c. 2) base relation anomaly Suppose each employee earning more than his or her manager is to take a 10 percent pay cut. Consider the database fragment shown in Fig. 4.2.2. 36 Fig. 4.2.2 Base Relation Anomaly NAME SAL MGR Smith 10K Jones Jones 8K Killer Brown 9.5K Smith Suppose update is immediate here. The only employee qualifying for a pay cut is Smith. Smith now earns 9K. Note that Brown now earns more than Smith, and as a consequence of the immediate update of Smith's tuple, Erown has to take a pay cut, tool It has been pointed out that deferred update incurs a heavy over- head, and an optional immediate update (in process 3) is now be- ing considered. 4.3 SYSTEM R A brief discussion on the access paths that SYSTEM R supports is in order before the SYSTEM R optimizer is examined. SYSTEM R supports three access paths for tuples of relations; images, links, and relation scans. An image is conceptually a binary relation consisting of values from one or more fields, called sort fields , of a relation in ascending or descending order, and their corresponding tuple identifiers. At most one clustering image for a relation is supported, in which tuples whose sort values are "close" are stored physically "close" to one another. Access to any tuple of a relation with an image support is accomplished by keying on the sort field value. Access to a sequence of tuples within a range of sort field 37 values can also be done efficiently on a relation with an image support. Each image on a relation resides in one or more pages within the segment in which the relation is stored. The pages in which an image resides form the nodes of a balanced B-tree (for a comprehensive discussion on B-trees, see Chapter k.5 of k8 ). A link, on the other hand, is a doubly linked list that connects tuples in one relation (a unary link), or a tuple of one relation with a sequence of tuples of another relation (a binary link). A unary link may be used to maintain for a rel- ation a tuple ordering that is not ascending or descending on any sort fields. It may also be used to connect all tuples of a relation for fast access without the time penalty of an en- tire relation scan (see below). A binary link may significantly enhance the processing time for a join. For example, if a bin- ary link is maintained between each DEPT tuple and all EMP tuples in our example database based on value matches in DNO columns of both relations, to respond to a query "Find the names and salaries of all employees in department 5." an appropriate DEPT tuple is first found and then its binary link is followed to retrieve all required EMP tuples. Finally, a relation scan is simply a sequential search through all data pages of a seg- ment that contains the relation. Since tuples from different relations may be stored in one page (see section 3»5)i the pre- fix of each tuple retrieved must be examined to determine if the tuple belongs to the required relation. 38 Now we are ready to understand the SYSTEM R optimizer. The objective of this optimizer is to minimize the expected num- ber of page fetches from secondary storage. The optimizer selects an optimal method of executing a query as follows « first, it classifies the query into one of several types according to the presence of some particular language features (e.g., join, group by); second, it examines the system catalogs to determine the set of access paths available for the relations involved; third, it determines the set of "reasonable" methods for executing the query from the set of access paths found; and fourth, it evalu- ates the expected cost formula for each "reasonable" method, and selects the minimum cost method. The parameters of the cost formula include the relation cardinality R, the number of data pages occupied by the relation D, the average number of tuples per data page T (= R/D), the image cardinality I (the number of distinct sort field values in a given image), and the esti- mated CPU cost for tuple comparisons H. Some of the "reasonable" methods for executing a simple query involving a single relation are illustrated below along with their expected cost. To be concrete, the following SEQUEL que- ry for finding the names and salaries of all programmers over 25 years of age is considered. SELECT MAO, SAL FRCP.'! BMP WHERE JOB = programmer AND AGE > 25 1) use of a clustering image that matches a predicate of the form (column = constant). An image "matches" a predicate 39 if the sort field of the image matches the column name of the predicate. If EMP has the clustering image on JOB, this op- tion is applicable. The expected cost to retrieve all result tuples is R/(T x I) page accesses. 2) use of a clustering image that matches a predicate of the form (column comparator constant). If EMP has the clustering image on AGE, this option is applicable. Assuming that half the tuples in the relation satisfy the predicate, the expected cost is R/(2 x T) . 3) use of an image (non-clustering) that matches a predicate of the form (column = constant). Since each tuple requires a page access, the expected cost is R/l. *0 use of an image that matches a predicate of the form (column comparator constant). The expected cost is R/2. 5) use of a clustering image that does not match any predicate. The image is scanned, and each tuple retrieved is evaluated against all predicates. The expected cost is R/T + H x R x N, where N is the number of predicates in the query. 6) use of an image that does not match any predicate. The ex- pected cost is R + H x R x N. For a more complete discussion on the processing of a simple query, see (47) . Now let us examine some of the "reasonable" methods for executing a query involving a join of two relations. A detailed discussion on these methods is given in (47). 40 Again to be concrete, let us consider a SEQUEL query that finds the names and jobs of all employees who work on the third floor and earn over 20X. SELECT NAME, JOB FROM EMP, DEFT WHERE EMP. SAL > 20 K AND EMP.DNO = DEPT.DNC AND DEPT.LCG = 3 1) use images on join fields of both relations i a. Scan along the image on DEPT.DNC to retrieve a DEPT tuple whose LOC value is 3« b. Scan along the image on EMP.DNO to retrieve all EMP tuples whose DNC values match the current DEPT.DNO value and whose SAL values are greater than 20K. c. Repeat steps a and b until the image on DEPT.DNC is com- pletely scanned. 2) sort both relations along the join fields » a. Scan EMP using its clustering image, and create a tempora- ry relation Rl with NAME, JOB, and DNO fields from those tuples of EMP whose SAL values are greater than 20K. b. Scan DEFT using its clustering image, and create a tempora- ry relation R2 with DNC field from those tuples of DEPT whose LOC is J, c. Sort Rl and R2 on DNC fields. d. Apply method 1) above. 3) use multiple passes : a. Scan DEPT using its clustering image, and insert into a main memory data structure W the DNC values of all DEPT 41 tuples whose LOC values are 3. W is a unary relation in our example. If W can hold only N such tuples, the small- est N tuples are placed in W during current pass. The remaining tuples are inserted into W during subsequent passes. b. Once W is constructed, scan EMP using its clustering image. For each EMP tuple T whose SAL value > 20K, T. NAIVE and T.JC3 values are placed in the output list, if T.DNO is found in W. c. If W was not large enough to hold all qualified DEPT tuples, steps a and b are repeated. 4) use the TID algorithm : a. Construct a sorted list Rl of the tuple identifiers of the EMP tuples whose SAL values > 20K using the image on EMP. SAL. b. Construct a sorted list R2 of the TIDs of the DEPT tuples whose LOC values = 3 using the image on DEPT. LOC. c. Perform a simultaneous scan over the DEPT.DNO and EMP.DNC images to find the TID pairs of tuples whose DNO values match. If Tl and T2 of each such TID pair (Tl, T2) are in Rl and R2, respectively, then the tuple referenced by Tl is retrieved and its NAME and JOB field values are placed in the output list. The use of a binary link in processing a join was explained ear- lier in this subsection. Remember that each of these ' options is applicable only if its requisite access paths are available. h2 SYSTEM R currently relies on the users to dynamically create and delete all access paths maintained on their base relations. Tech- niques are being investigated that will shift some of these bur- dens away from the users. 5. SUPPORT CF USER VIEWS An individual user's view of the database may be derived from the underlying relations by permuting their columns, or applying sel- ection, projection, or join operations. A view can be any deriv- able relation as far as supporting retrieval operations is con- cerned. For supporting storage operations (insert, delete, up- date), however, a view must be identical to the underlying base relations, except that individual tuples and non-primary key fields may be omitted in the view. This restriction is essential in propagating the effects of the storage operations down to the underlying base relations. Systems that support user views in- clude I 3/1, PRTV, SEQUEL, ZETA, and SYSTEM R. A good introduc- tion to the notion of view is given in (49) and Chapter 10 of (6). This section examines the view mechanisms implemented for I 3/1, ZETA, and SYSTEM R. 5.1 13/1 In is/l a view definition is translated into an equivalent string of intermediate codes (see section 2.2) and stored away (an impli- cit view). When this view is accessed for the first time, its definition is retrieved and executed to yield the desired rela- tion. The relation thus materialized is then stored as a new base relation (an explicit view), so that subsequent accesses to 20 K An integrity constraint is in force on EMP: RANGE 0? E IS EMP INTEGRITY CONSTRAINT IS E.SAL < 30K Smith is given the authority to update salaries of employees he manages : RANGE OF E IS EMP PROTECT EMP FOR ALL (E.3AL; E.NAME, E.DNO) WHERE E.MGR = Smith ( Smith is authorized only to read the names and department numbers of his employees) Now suppose that Smith wishes to give Jones a 20 percent pay raise by issuing the following QUEL statement: RANGE OF S IS VEMP REPLACE E (SAL = 1.2 * S.SAL) WHERE E.NAME = Jones Since this query is against a view, it is first modified as follows : RANGE OF E IS EMP REPLACE E (SAL = 1.2 * E.SAL) WHERE E.NAME = Jones AND E.SAL > 20K Note that the predicates of the query and the view definition are merged, and also that the range variable E now ranges over the base relation EMP. 50 ^he integrity constraint is next merged into this modified query to yield the following: RANGE CF E IS EMP REPLACE E (SAL = 1.2 * E.3AL) WHERE E.NAME = Jones AND E.3AL > 20 K AND 1.2 * E.3AL -c 30K Finally, the authorization restriction is merged into the query, and the resulting query is ready to be passed down to process 3» RANGE CF E 13 EM? REPLACE E (3AL = 1.2 * E.3AL) WHERE E.NAKE = Jones AND E.3AL "> 20K AND 1.2 * E.SAL < 30K AND E.MGR = Smith 6.3 SYSTEM R SYSTEM R supports extensive data control facilities, including authorization, integrity assertions, transactions, and triggers. 1) authorization SYSTEM R does not provide for a centralized database administra- tor. Each user creates and deletes his own base relations, and defines and destroys views on them. A user also grants and re- vokes selected privileges on his views and base relations to other users. These privileges may be any combinations of the following: read, insert tuples, delete tuples, update selected fields of tuples; specify access paths, triggers, and integrity assertions; and grant or revoke any of these privileges to still other users. A detailed account of the SYSTEM R authorization mechanism is given in (51). 51 The nested granting of privileges and their subsequent revocation complicate the implementation of the authorization mechanism. When a user creates a base relation, he is fully authorized to perform any operation on it, and also to grant any or all privi- leges on it to any other user. The update privilege is held on each column of the base relation. The privileges that any user nolds on a base relation are either grantable or not grantable. The set of privileges that the user may grant to other users is the union of all grantable privileges he holds on the relation. The set of privileges that the user holds on the relation, on the other hand, is the union of all grantable and non-grantable privileges he holds on the relation. A user may revoke only those privileges he previously granted. When a user A revokes a grantable privilege P from another user B, P is withdrawn from user B and all other users to whom user B may have granted P. Note that even when user A revokes P from user B, user B may still retain ? if another user C previously granted P to user B. Consider the following situation, assuming that Smith is the cre- ator of the EMP relation: 1 . Smith grants to Jones the read and insert privileges on EMP with the grant option. 2. Smith grants to Brown the read privilege on EMP with the grant option. 3. Jones grants to Brown the insert privilege on EMP. Erown is then authorized to insert tuples into the EMP relation, make queries against the relation, and grant the read privilege to other users. Note also that the read privilege permits Erown 52 to define views on the EMP relation. Now if Smith revokes all privileges from Brown, Brown loses the read privilege and the privilege to grant the read privilege, but retains the insert privilege that Jones granted to him. If, on the other hand, Smith revokes the grant privilege from Jones, brown loses the insert privilege as a result. When a user defines a view, his privileges on it are restricted to only those privileges that he holds on the underlying views and base relations. If the view involves more than one relation, the user's privileges are defined as the intersection of the pri- vileges held on all underlying relations. The user may grant a orivilege ? on the view if and only if he holds a grantable privi- lege P on all underlying relations. The update privilege is held on a column of the view only if it is held on that column of all underlying relations in which the column appears. 2) integrity assertions SYSTEM R supports integrity assertions that describe either valid states (state assertions) or valid state transitions (transition assertions). For example, the assertion, "No employee should earn less than 10K, " is a description for a valid state; while the assertion, "No employee should be given a pay cut, " constrains a state transition. SYSTEM ?. allows both state and transition assertions to be imposed on individual tuples of a relation or on sets of tuples. For ex- ample, "The average salary of employees of any department must not exceed 15K" is a state assertion applied to a set of tuples, 53 while ,,rv he average salary of employees of any department must not rise by more than 10 percent a year" is a transition asser- tion on a set of tuples. 3Y3TE?" P, furthermore, allows the user to specify a sequence of SEQUEL statements as a transaction , and checks for and enforces integrity constraints at the end of each transaction by default, so that consistency constraints may be applied to a sequence of complex storage operations which may pass through states that temporarily violate these constraints. For example, in transfer- rins money from Smith's savings account to checking account, the consistency constraint that Smith maintain a constant total bal- ance is temporarily violated when the savings account is debited but the checking account is not yet credited. State assertions apply to the final state of the database, while transition asser- tions compare the states of the database before and after the transaction. When integrity assertions are declared to be "immediate," however, they are enforced immediately, even on intermediate stages of a transaction. An immediate assertion on a tuple is enforced when- ever a tuple is updated, while an immediate assertion on a set of tuples is enforced after each SEQUEL update statement is exe- cuted. SYSTEM 3 allows the user to specify save points within a transaction and back up either to the beginning of the transac- tion or to any save point by undoing all changes made to the data- base by the transaction. The user may also establish integrity points (similar to the transaction save points) within a long transaction in order to have the transaction backed out only to 5^ its most recent integrity point. Integrity constraints are en- forced at each integrity point within a transaction. Finally, integrity assertions are dynamically created and deleted by the user. The SYSTEM R approach to integrity control is des- cribed in detail in (52). 3) Triggers A trigger is a sequence of SEQUEL statements that is automatical- ly executed when a pre-specified condition is satisfied. For ex- ample, whenever an EMP tuple is inserted, or deleted, or its DNC field is modified, the NEMP3 field of an appropriate DEFT tuple (EMP.DNO = DEPT.DNO) must be modified to reflect the change in the EM? relation. The following trigger, named EMPIN3ERT, auto- matically increments the NEMPS field of a DEPT tuple when a new EMP tuple is inserted: DEFINE TRIGGER EMPIN3ERT CM INSERT OF EMP t (UPDATE DEPT SET NEMPS = NEMPS + 1 WHERE DNC = NEW EMP.DNO) Triggers EMFDELETE and SMPUPDATE can also be easily defined for execution when an EMP tuple is deleted, and when the DNO field of an EMP tuple is modified, respectively. 55 7. CONCURRENCY CONTROL Systems that support concurrent updates on a shared database in- clude GMI3, INGRES, and SYSTEM R. This section discusses the concurrency control implemented for INGRES and SYSTEM R. 7.1 INGRES The current version of INGRES provides only a crude concurrency control. The unit of data consistency is one QUEL statement. INGRES uses physical locks on columns of relations, and avoids deadlocks by not allowing an interaction to pass down to process 3 until it can lock all required resources. When a lock request is made, the LOCK catalog is first physically locked, and then checked for the presence of the lock. If the lock is not in the catalog, it is placed in the catalog, and the lock request is satisfied. If the lock is already in the catalog (i.e., the lock is held by another concurrent user), the concurrency processor of process 2 waits for a fixed interval, and then tries again. The concurrency processor deletes appropriate locks from the cat- alog only when it receives a termination message from process 3» 7.2 SYSTEM R SYSTEM P. implements locking techniques at the logical level of relations and tuples as well as at the physical level of pages. Locking: at the logical level is implemented to support isolated backout of a transaction and retrieval of "clean" data as well as synchronization of concurrent updates. The isolated backout 56 of a transaction refers to the data consistency constraint that the backout of a transaction's updates must not undo another concurrent transaction's updates. A data item remains "dirty" as lonr as the transaction within which it has been updated may be backed out; it becomes "clean" only when the transaction goes to completion. The classic "lost update" problem is also resolved by the logical level locking. v.'hen the system locks relations, tuples, or index values at the logical level, it must also acquire locks at the physical level of data or index pages to insure accurate results. For example, transaction Tl makes an access to the EMP relation, and transac- tion m 2 to the DEPT relation. Suppose tuples from the two rela- tions are stored in the same page. Tl may obtain the byte address of an SMP tuple from a header area slot; and before Tl retrieves the tuple, T2 may retrieve a DEPT tuple and invoke a data compac- tion routine to reassign tuple addresses in the page. The tuple address that Tl previously retrieved may not point to the correct tuple any longer. The logical locks set on the two relations thus do not guarantee accurate results. Physical locks are held for the duration of a single SEQUEL state- ment execution, while logical locks are held until they are expli- citly released or to the end of a transaction. Locks may either be shared or exclusive. If a transaction is only to read a data item, it sets a shared lock on the data item and allows other con- current transactions to read the same data item. If a transaction holds an exclusive lock on a data item, no other concurrent trans- actions are allowed to read or update the data item. 57 SYSTEM R locks data objects at various granularities so that ap- plications accessing different amounts of data are run efficient- ly. Locks can be set on individual tuples, or an entire relation, or even an entire segment. For a transaction requiring an access to a small amount of data, it would be reasonable to lock individ- ual tuples; while it may be more efficient to lock entire rela- tions, if a transaction accesses a very large amount of data. Unlike INGRES, which avoids deadlocks, SYSTEM R is designed to periodically check for deadlock situations, and recover from deadlocks by backing out one or more transactions on the basis of their relative ages and on the duration of their locks. SYSTEM R implements three levels of consistency in isolating each transaction from the effects of other concurrent transac- tions. At any level of consistency the isolated backout of a transaction is guaranteed. The consistency levels differ in the degree of isolation of each transaction from other concur- rent transactions, and in the processing overhead. A transaction at level 1 consistency may read "dirty" data, and different values for the same data during the same trans- action. Although it offers the least degree of isolation of a transaction from others, this level incurs the least proces- sing overhead, and may be entirely satisfactory when exact results are not required. At this level no logical locks are required for a read operation. At level 2 consistency, a transaction reads only "clean" data? however, it may still read different values for the same data 58 durinr the same transaction, since the data may be updated by another transaction and become "clean" before the first trans- action goes to completion. If a tuple is read, a shared lock must be held on the tuple for the duration of the read opera- tion. At level 3 consistency, the user sees the logical equivalent of a sinrle user system. Each transaction reads only "clean" data, and the same values for the same data during the same transac- tion. Although it guarantees complete isolation of a transac- tion from other concurrent transactions, this level incurs the highest processing overhead. If a tuple is read at this level of consistency, a shared lock must be held on the tuple until the end of the transaction. It is noted that if a tuple is updated, an exclusive lock must be held on the tuple for the duration of the transaction for all three levels of consistency. The SYSTEM P. approach to concurrency control is given in (37 » 38). 8. SYSTEM CHECKPOINT AMD RECOVERY ^ost of the early prototype relational database systems do not provide functions to restore the database to a consistent state in the event of a system crash. In this section the system checkpoint and recovery schemes implemented in INGRES and SYSTEM ?. are presented. 8.1 INGRES 59 Recall that for an update operation all qualifying tuples are isolated and spooled onto a temporary file in process 3 and the actual update of the base relations and the secondary in- dices maintained on them is performed in process b. INGRES ac- complishes recovery from system crashes that leave the disk intact by either destroying the temporary deferred update file if process 4 has not started processing it, or by allowing the deferred update file to be completely processed if process k has already started processing it. To recover from system crashes that destroy the disk files, the INGRES "superuser " logs all database interactions, and checkpoints the database onto tape using the UNIX backup scheme. Recovery is then accomplished by restoring the last checkpoint and running the log of interactions. 8.2 3Y3TEP4 R The SYSTEM R scheme for recovery from system crashes that leave the disk storage intact (i.e., only the contents of main memory are destroyed) and also from those that destroy the disk storage is based on the use of transaction logs and a set of segment re- covery functions. The set of segment recovery functions includes the RSI operators OPEN_SEGMENT, 3AVE__3EGMENT , and RE3TCRE_3EGMENT . To efficiently support these functions, two page maps, called current and backup, are associated with each recoverable segment. When a segment is first opened by the 0PEN_3EGMENT operator, these two page maps contain identical entries. V/hen a page of the segment is access- 60 ed for an update operation for the first time since the segment was opened or was last saved (see below), the page is fetched into a main memory buffer, a new page slot is allocated on the disk, and the appropriate entry in the current page map is mod- ified to reference the new page slot. The page is then updated in its main memory buffer, and written out on the new page slot. The 5AVE__3EGMENT operator causes all pages of a segment that have been updated in their main memory buffers to be written onto their newly allocated page slots. All old page slots for those pages that have been updated since the last segment save point are released, and the backup page map entries are set equal to their corresponding current page map entries. The P.E3TCRS_3SGMENT operator causes the current page map entries to be set equal to their corresponding backup page map entries and all newly allocated page slots to be released, since the backup page map points to a consistent copy of the segment. lecrments dedicated for user relations and segments intended for the logs of transactions are both recovered via these re- covery functions. For recovery from disk- intact system crashes, the system fre- quently checkpoints segments containing user relations. The transaction log segments are checkpointed at the end of each transaction, and contains both the old and new values of all updated data objects. For recovery from non-disk-intact system crashes, the system not only checkpoints data segments and transaction log seg- ments, but also copies the saved pages from disk to tape. 61 REFERENCES 1. Codd, E.F. A Relational Model for Large Shared Data Banks, CACM vol. 13, no. 6, June 1970, pp. 377-387 2. Codd, E.F. A Data Base Sublanguage Founded on the Relational Calculus, Proc. 1971 ACM SIGFIDET Workshop on Data Descrip- tion, Access and Control, San Diego, Calif., Nov. 1971 3. Codd, E.F. Further Normalization of the Data Base Relational Model, Courant Computer Science Symposia, vol. 6, Data Base Systems, Prentice Hall, N.Y., May 1971 4. Codd, E.F. Relational Completeness of Data Base Sublanguages, Courant Computer Science Symposia, vol. 6, Data Base Systems, Prentice Hall, N.Y., May 1971 5. Codd, E.F. Recent Investigations in Relational Data Base Sys- tems, Proc. IFIP Congres 74, Stockholm, Sweden, Aug. 1974 6. Date, C.J. An Introduction to Database Systems, 2nd. edition, Addison-Wesley, 1977 7. Chamberlin, D.D. Relational Data-Base Management Systems, Computing Surveys, vol. 8, no. 1, March 1976 8. Whitney, V.K. A Relational Data Management System, Proc. 1974 ACM SIGMOD Workshop on Data Description, Access and Con- trol, Ann Arbor, Mich., April 1974, pp. 55-66 9. Steuert, J., and J. Goldman. The Relational Data Management System: A Perspective, Proc. 1974 ACM SIGMOD Workshop on Data Description, Access and Control, Ann Arbor, Mich., April 1974, PP. 295-320 10. Rothnie, J.B. An Approach to Implementing a Relational Data Base Management System, Proc. 1974 ACM SIGMOD Workshop on Data Description, Access and Control, Ann Arbor, Mich., April 1974 11. Pothnie, J. 3. Evaluating Inter-Entry Retrieval Expressions in a Relational Data Base Management System, Proc. AFIPS 1975 National Computer Conference, vol. 44, pp. 417-423 12. Smith, J.M., and P.Y.T. Chang. Optimizing the Performance of a Relational Algebra Database Interface, CACM vol. 18, no. 10, Oct. 1975, PP. 568-579 13. Boyce, R.F., and D.D. Chamberlin. A Structured English Que- ry Language, Proc. 1974 ACM SIGMOD Workshop on Data Descrip- tion, Access and Control, Ann Arbor, Mich., May 1974 62 14. Boyce, R.F., and D.D. Chamberlin. Using a Structured English Query Language as a Data Definition Facility, IBM Research Report: RJ 1318, San Jose, Calif., Dec. 1973 15. Astrahan, M.M., and D.D. Chamberlin. Implementation of a structured English Query Language, CACM vol. 18, no. 10, Oct. 1975, PP. 580-588 16. Zloof, Wi.M. Query By Example, Proc. AFIP3 1975 National Com- puter Conference, vol. 44, May 1975 17. Zloof, M.F. Query By Example: The Invocation and Definition of tables and Forms, Proc. International Conference on Very Large Data Bases, Framingham, Mass., Sept. 1975 18. Reisner, F. , R.F. Boyce, and D.D. Chamberlin. Human Factors Evaluation of Two Data Base Query Languages — SQUARE and SEQUEL, Proc. AFIPS 1975 National Computer Conference, vol. 44, May 197 5 19. Thomas, J.C., and J.D. Gould. A Psychological Study of Query By Examole, Proc. AFIPS 1975 National Computer Conference, vol. 44* May 1975 20. Donovan, J.J., R. Fessel, S.S. Greenberg, and L.M. Gutentag. An Experimental VM/370 Based Information System, Proc. Inter- national Conference on Very Large Data Bases, Framingham, Mass. Sept. 1975, PP. 549-553 21. Codd, E.F. Seven Steps to Rendezvous with the Casual User, Proc. IFI? TC-2 Working Conference on Data Base Management Systems, April 1974, North-Holland 22. Goldstein, R.C., and A.J. Strnad. The MacAIMS Data Manage- ment System, Proc. 1974 ACM SIGMGD Workshop on Data Descrip- tion, Access and Control, Ann Arbor, Mich., May 1974 23. Strnad, A.J. The Relational Approach to the Management of Data Bases, Proc. IFIF Congres 1971, pp. 901-904 24. Not ley, M.S. The Peterlee 13/1 System, IBM (UK) Scientific Centre Report: UKSC - 0018, March 1972, Peterlee, England 25. Todd, 3. J. P. The Peterlee Relational Test Vehicle — A Sys- tem Overview, IBM Systems Journal, vol. 15, no. 4, 1976, pp. 285-307 26. Hall, P.A.V. Optimisation of a Single Relational Expression in a Relational Data Base System, IBM (UK) Scientific Centre Report: UKSC - 0076, June 1975, Feterlee, England 27. Lorie, R.A. XRM -- an Extended (n-ary) Relational Memory, IBM Scientific Center Report: G320-2096, 1974, Cambridge, Mass. 63 28. Astrahan, M.M., and R.A. Lorie. SEQUEL-XRMi A Relational System, Proc. ACM Pacific Regional Conference, April 1975» pp. 3^-38 29. Czarnik, B., 3. Schuster, and D. Tsichritzis. ZETA 1 A Rela- tional Data Base Management System, Proc. ACM Pacific Region- al Conference, April 1975, pp. 21-25 30. Mylopoulos, J., S. Schuster, and D. Tsichritzis. A Multi- Level Relational System, Proc. AFIPS 1975 National Computer Conference, vol. kk t May 1975. PP- 403-408 31. Stonebraker, M.R., E. Wong, and P. Kreps. The Design and Im- plementation of INGRES, ACM Transactions on Database Systems, vol. 1, no. 3, Sept. 1976, pp. 189-222 32. Stonebraker, M.R. Implementation of Integrity Constraints and Views by Query Modification, Proc. ACM 1975 SIGMOD Work- shop on Management of Data, San Jose, Calif., May 1975, pp. 65-78 33. Stonebraker, M.R., and E. Wong. Access Control in a Relation- al Data Ease Management System by Query Modification, Proc. ACM 1974 National Conference, San Diego, Calif., Nov. 1974, vol. 1, pp. 180-186 34. Wong, E. , and K. Youssefi. Decomposition — A Strategy for Query Processing, ACM Transactions on Database Systems, vol. 1, no. 3, Sept. 1976, pp. 223-241 35. Ritchie, D.M., and K. Thompson. The UNIX Time-Sharing System, CACM vol. 17, no. 7, July 1974, pp. 365-375 36. Astrahan, M.M. et. al. System R: Relational Approach to Data- base Management , ACM Transactions on Database Systems, vol. 1, no. 2, June 1976, pp. 97-137 37. Gray, J.N., R.A. Lorie, and G.R. Putzolu. Granularity of Locks in a Shared Data Ease, Proc. International Conference on Very Large Data Bases, Framingham, Mass., Sept. 1975, pp. 428-451 38. Gray, J.N., R.A. Lorie, G.R. Putzolu, and I.L. Traiger. Granularity of Locks and Degrees of Consistency in a Shared Data Base, Proc. IFI? TC-2 Working Conference on Modelling in Data Base Management Systems, Jan. 1976, North-Holland 39- Eswaran, K.P., J.N. Gray, R.A. Lorie, and I.L. Traiger. The Notions of Consistency and Predicate Locks in a Data Base System, CACM vol. 19, no. 11, Nov. 1976, pp. 624-633 40. Tremblay, J. P., and P.G. Sorenson. An Introduction to Data Structures with Applications, McGraw-Hill, 1976 64 41 . Gottlieb, D. et. al. A Classification of Compression Methods and Their Usefulness in a Large Data Processing Center, Proc. 4?I?3 1975 National Computer Conference, vol. 44, pp. 453-458 42. Palermo, P.P. A Data Ease Search Problem, IBM Research Re- port: PJ 1072, July 1972, San Jose, Calif. 43. Pecherer, R.M. Efficient Evaluation of Expression in a Rela- tional Algebra, Proc. ACM Pacific Regional Conference, April 1975, pp. 44-49 44. >enko, M.E., E.B. Altman, M.M. Astrahan, and P.L. Fehder. Data Structures and Accessing in Data-Base Systems, IBM Sys- tems Journal, vol. 12, no. 1, 1973 4 5. Tsichritzis, D. A Network Framework for Relation Implementa- tion, Proc. IFIP TC-2 Working Conference on Data Definition Language, Jan. 1975 46. Gotlieb, L.R. Computing Joins of Relations, Proc. ACM 1975 3IGMCD International Conference on Management of Data, San Jose, Calif., 1975. pp. 55-63 47. Blasgen, M.W., and K.P. Eswaran. On the Evaluation of Que- ries in a Relational Data Base System, IBM Research Report: PJ 1745, April 1976, San Jose, Calif. 48. Wirth, N. Algorithms + Data Structures = Programs, Prentice Hall, En^lewood Cliffs, N.J., 1976 49. Chamberlin, D.D., J.N. Gray, and I.L. Traiger. Views, Author- ization, and Locking in a Relational Data Base System, Proc. AFIPS 1975 National Computer Conference, vol. 44, pp. 425-430 50. Czarnik, E. A Primitive Relational Data Base Management Sys- tem, M.Sc. Thesis, Dept. of Computer Science, U. of Toronto, Oct. 1974 51. Griffiths, P., and B.W. Wade. An Authorization Mechanism for a Relational Data Base System, ACM Transactions on Database Systems, vol. 1, no. 3. Sept. 1976, pp. 242-255 52. Eswaran, K.P., and D.D. Chamberlin. Functional Specification of a Subsystem for Data Base Integrity, Proc. International Conference on Very Large Data Bases, Framingham, Mass., Sept. 1975, PP. 48-68 53. Owens, R.C. Evaluation of Access Authorization Characteris- tics of Derived Data Sets, Proc. 1971 ACM 3IGFIDET Workshop on Data Description, Access and Control, pp. 263-278 54. Graham, R.M. Protection in an Information Processing Utility, CACM vol. 11, no. 5, May 1968, pp. 365-369 IBLIOGRAPHIC DATA HEET 1. Report No. UIUCDCS-R-78-913 Title and Subtitle- Relational Database System Implementations — A Survey Author(s) Won Kim Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 . Sponsoring Organization Name and Address National Science Foundation Washington, D. C. >upp!cmt. nrary Notes 3. Recipient's Accession No. 5. Report Date January, 1978 8. Performing Organization Rept. No - UIUCDCS-R-78-913 10. Project/Task/Work Unit No. 11. Contract/Grant No. US-NSF-MCS73-07980 13. Type of Report & Period Covered Technical Report 14. . Abstract s This paper surveys the overall structure and design of some of the prototype relational database systems frequently referenced in the literature, and then examines in some detail some interesting approaches that have been taken in implementing various requisite features for a comprehensive relational database system. I Ke\ tt'ords and Document Analysis. 17a. Descriptors Authorization control Checkpoint and recovery Concurrency control Data consistency Integrity control }uery optimization Relational database Jser view ■ Identifiers Open-Ended Terms '■ ( OS ATI Fie Id /Group Availability Statement Release Unlimited C ,M NT1S-35 ( 10-70; 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 66 22. Price USCOMM-DC 40329-P7! MAP 3 1Q7ft VSBB9B8XBS