The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN WoTip L161— O-1096 UIUCDCS-R-81-1052 UILU-ENG 81 1721 A Characterization of Globally Consistent Databases and their Correct Access Paths by Yehoshua Sagiv July 1981 Digitized by the Internet Archive in 2013 http://archive.org/details/characterization1052sagi UIUCDCS-R-81-1052 A Characterization of Globally Consistent Databases and their Correct Access Paths Yehoshua Sagiv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 July 1981 This work was supported in part by NSF grant MCS-80-03308. ABSTRACT We propose the representative instance as a representation of the data stored in a database whose relations are not the projections of a universal instance. We characterize the database schemes for which local consistency implies global consistency. (Local consistency means that each relation satisfies its own functional dependencies, and global consistency means that the representative instance satisfies all the functional dependencies.) We show how to compute efficiently projec- tions of the representative instance provided that local consistency implies global consistency. Throughout this work we assume that a cover of the functional dependencies is embodied in the database scheme in the form of keys. KO^ Introduction The universal instance assumption is essential to many papers in design theory for relational databases. As pointed out in [FMU], two different concepts are included in this assumption. The most basic con- cept is the universal relation scheme assumption (also known as the "uniqueness assumption" [Ber]). It asserts that each attribute has a unique role, that is, for any subset of attributes X, there is (at most) one relationship among the attributes of X. This assumption is made explicitly or implicitly in many papers in design theory for relational databases. In particular, it is made in papers dealing with the axiomatization of dependencies, and synthesis and decomposition of rela- tion schemes. The second and more controversial concept is the universal instance assumption , that is, the assumption that the relations of a database are the projections of a single relation over the set of all the attributes. This assumption is needed in order to define lossless joins [ABU]. There are two versions of this assumption. According to the first, this assumption has to be made only in order to determine whether a join is lossless, i.e., it is only a tool that is used when a database is designed and when queries are evaluated [FMU]. The second version (the pure universal instance assumption) states that the relations of a data- base must always be the projections of a universal instance, and null values have to be used in order to satisfy this requirement [HLY,Ko,Li,Ma,Sc]. - 2 - When a universal instance is assumed, users usually formulate queries having in mind the universal instance rather than the actual relations of the database [KU] . If a given query refers to a set of attributes X, then the first step in evaluating this query is to compute the projection of the universal instance onto X. If the pure universal instance assumption is made, it is sufficient to take any lossless join over a set of attributes that contains X. If the relations of the data- base are not the projections of a universal instance, different lossless joins might give different results. In [FMU] this problem is solved by requiring that the join dependency consisting of all the relation schemes be acyclic. In this paper we propose an alternative solution. We believe that our solution reflects the properties of functional dependencies better than [FMU] . In particular, some of the problems left open in [FMU] are handled in our case without explicitly defining maximal objects [MU] . In this paper we assume a universal relation scheme, but not a pure universal instance. Instead we define the representative instance of a database (in Section 3). The representative instance has been used to determine whether the database satisfies a set of functional dependen- cies [Hol,Val]. We believe that the representative instance correctly describes the information stored in the database even when the relations are not the projections of a universal instance. As a first step toward evaluating queries with respect to the representative instance, we address the following problems. (1) Under what conditions the database is globally consistent if each - 3 - relation is locally consistent? That is, under what conditions the representative instance satisfies all the functional dependencies if each relation satisfies its own functional dependencies? (2) How can we compute efficiently projections of the representative instance? Throughout this paper we assume that a cover of the functional dependencies is embodied in the database scheme in the form of keys (as in [Ber]). In Section 3 we define the uniqueness condition , and in Sec- tion 4 we prove that local consistency implies global consistency if and only if the database scheme satisfies the uniqueness condition. In Sec- tion 5 we show how to compute efficiently projections of te representa- tive instance provided that the database scheme satisfies the uniqueness condition. 2_. Preliminaries 2_._1 Basic Definitions We view a relation (cf. [CI]) as a finite table with columns labeled by attributes , and rows, called tuples , that represent mappings from the attributes to their associated domains . Let |i be a tuple of a relation labeled by a set of attributes R. If A e R, then |i(A) is the value of \x in column A; if S C R, then u[S] denotes the values of \i for the attributes in S . We say that r is a relation over a set of attri- butes R if the columns of r are labeled by the attributes of R. - 4 - We use the letters A,B,C,... to denote attributes, and the letters . ..,X,Y,Z to denote sets of attributes. A string of attributes (e.g., ABCD) denotes the set containing these attributes, and the union of two sets X and Y is written XY. A relational database scheme over a set of attributes U is a set of ordered pairs , . . . , such that R. C U and K. is a set of linn 1 — i (explicit) keys of R. . (If X e K. , then XC R ; and no key of R. prop- erly contains another key of R. .) Each R. is a set of attributes label- ing the columns of a relation, and we use it as the name of the relation (i.e., there are no distinct relations over the same set of attributes). We call each R. a relation scheme . The set of FDs (functional dependen- cies) of R. is 1 F = {X -» A | X e K. and A e R. - X} A database is a set of relations r, , . . . ,r over R. , . . . ,R , respec- In In tively, such that each r. satisfies F . That is, a database is the n "current value" of the database scheme. We assume that F = \} F is a i=l cover of all the FDs imposed on the database by the user. In other words, a cover of the FDs is embodied in the database scheme in the form of (explicit) keys (as in [Ber]). 2_.2_ Relations with Null Values In many cases there is a need to represent partial information in the database. If we have a relation over the attributes Manager and - 5 - Department, and Jones is a manager without a department, then the tuple (Jones, 6) is inserted into this relation. The value 6 is a special value, called a null value , and it denotes unknown information. Suppose that there are two managers without a department, e.g., Jones and Smith. There is no reason to assume that they manage the same (unknown) depart- ment. In order to distinguish the null value in the tuple (Jones, 6) from the null value in the tuple (Smith, 6), we will mark each null value with a unique subscript and store the tuples (Jones, 6,) and (Smith, 6«). Null values with distinguishing subscripts are called marked null [Ko,Ma], and will be used exclusively in this paper.. Two null values are equal only if they have the same subscript. We say that tuples \x. and [i agree on column A, written n (A) = u.(A), if either both u (A) and n 9 (A) are not null and equal or both are null and equal. 2.*2 The Chase Process Suppose that a relation r (with null values) is required to satisfy an FD X -> A. It is not necessarily correct to argue that r violates X -*■ A if it has two tuples that agree on X and disagree on A. Instead, (2) we should use X -*■ A to equate symbols of r in the following way. Suppose that r has tuples u. and n_ that agree on all the columns for X but disagree on the column for A. If u. has 6. in column A and |i_ has 6 . in column A, then we can replace all occurrences of 6 . in r with 6. . If u. has a non-null value c in column A and u_ has a null value 6 . in (2) Occasionally, we refer to null and non-null values as symbols. - 6 - that column, then we can replace all occurrences of 6. with the non-null value c . Suppose that the relation r is required to satisfy a set F of FDs . We can apply the FDs of F to r until no more symbols of r can be equated. The relation obtained in this way is called the chase of r with respect to F, written chase (r), and it satisfies an FD X •> A of F F if and only if there is no pair of tuples that agree on X and disagree on A. We say that the relation r sa tisfies F if and only if chase (r) F satisfies F. If r satisfies F, then chase (r) is unique up to renaming F of null values [MMS]. Example 1 : Suppose that F = {A->-C,A->-D, B+C , CD+B } and let r be the following relation. A B c D 1 1 6 1 6 2 6 3 2 1 6 4 1 6 5 6 6 2 2 1 6 7 1 We apply the FD A ■> C to the first and third tuples to replace 6, with 6 . Then all occurrences of 6. are replaced with 6 7 by applying B ■*■ C to the fourth and first tuples. The result is the following relation. A B c D 1 1 6 7 6 2 6 3 2 1 6 4 1 6 5 6 7 2 2 1 6 7 1 - 7 - By applying A + D to the first and third tuples, 6~ is replaced with 2. Now 6- is replaced with 1 by applying CD ♦ B to the first and third tuples. Since no more applications are possible, chase (r) is F A B C D 1 1 5 7 2 6 3 2 1 6 4 1 1 6 7 2 2 1 6 7 1 Clearly, chase (r) (and hence r) satisfies F. If we add the FD C ■*■ D to F F, then chase (r) is no longer unique, since 6~ can be replaced wi either 1 or 2, and C ■> D is not satisfied by chase (r) (and r) . [] r th If r satisfies F, then r also satisfies additional FDs that can be inferred by Armstrong's axioms [Arm] . The closure of a set of attri- butes X with respect to a set of FDs F, written 3L,, is the set of all attributes A such that X ->■ A can be derived from F by Armstrong's axioms. We can compute X^ in linear time [BB]. If F denotes a cover of all the FDs imposed on the database (i.e., the embodied FDs), then we + + usually write X instead of X^. 2.4 Relational Expressions and Extension Joins In this paper we consider relational expressions over the operators project, (natural) join, and union that are denoted by ti, M, and \>, respectively. The operands are the relation schemes R. , . . . ,R . Let a be a set of relations r, , . . . ,r for the relation schemes R. ,...,R , 1* ' n 1' ' n' - 8 - respectively, such that each r satisfies F . The value of E for a, written v (E), is computed by substituting the relations r, ,...,r for a In the relation schemes R. , . . . ,R , and applying the operators according to the usual definitions. (When the join is applied, two tuples are joined on a given column only if they agree in this column.) Two expressions E 1 and E„ are equivalent , written E. = E„, if for all sets a, v (E n ) = v (E„). The expression E_ is contained in E- , written E_ C E.. a 1 a 2 r 2 1' 2—1 if for all sets a, v (E_) C v (E.). An expression E has a unique value a 2. — a l for the current database and, so, by a slight abuse of notation we denote this value by E (rather than v (E), where a is the set of the current relations) . In particular, if u is a tuple in the value of E for the current database, then we write u. e E. m The expression t><| R. is an extension join [Ho2] of R. if for all j=l X 3 1 Kj,, be a database scheme. ABC M BD is an extension join of ABC. Similarly, ABC M BD M CDE is an extension join of ABC, since ABC contains the key B of BD, and ABCD contains the key CD of CDE. ABC M CDE is not an extension join, since ABC does not contain any key of CDE (i.e., ABC does not contain CD which is the only key of CDE). [] (1) This notion of equivalence is called strong equivalence in [ASU] . - 9 - _3 . The Representative Instance The ultimate goal of designing a database scheme has always been a collection of relation schemes R. , . . . ,R that are independent, that is, relation schemes that allow the user to update each relation in the database without having to change the contents of the other relations. Of course, there might be some semantically meaningful constraints (e.g., as in [C2]) that do not allow every possible update. But these constraints should be as limited as possible. We feel that enforcing the universal instance assumption is too restrictive. Clearly, this assumption can always be enforced by using marked nulls [KU,Ma]. How- ever, this can be done only at the expense of applying the chase process to the universal instance whenever updates are performed on the data- base. Furthermore, there is a need to store many null values that do not provide any information. These null values are needed only to satisfy the universal instance assumption. Efficiency is not the only issue. Most relational database systems are not designed to use the chase process. Therefore, there is a need to develop a theory for determining correct access paths when the universal instance assumption is not satisfied. The simplified univer- sal instance assumption of [FMU] is one possible solution. In [FMU] a certain condition is imposed on the database scheme, and it is claimed that under that condition a minimal (in the number of relation schemes) lossless join is a correct access path (even if the relations of the database are not the projections of a universal instance). We will present a different set of conditions and show that queries should be - 10 - evaluated by performing the union of several lossless joins rather than just one lossless join. In this section we will define the representative instance [Hol,Va2] of a database r , . . . ,r . The representative instance is defined for every r , . . . ,r (even if the r 's are not the projections of a single relation). If r. , . . . ,r are the projections of a universal instance r and the database scheme has the lossless join property, then r is also the representative instance. In [Hol,Val] the representative instance is used to determine whether the database satisfies a set of FDs . We believe that the representative instance is also a correct representation of all the information stored in the database, and queries posed about the contents of the database should be answered with respect to the representative instance. Let U be the set of all the attributes. A relation r can be viewed as a relation over U by adding columns for the attributes in U - R. that contain distinct null values. Formally, the augmentation of a relation r over R to a relation over U, written a TJ (r ), is {u u agrees with some tuple of r on R , and has distinct null values (that do not appear in any other tuple) for the attributes of U - R.} Example 3 : Let r be the relation - 11 - a..„ n (r) is the relation ABCD 1 2 c 3 itloi C 4 A B I c 1 D C l s ij c >! 6 2 C 3 6 3 1 *l 6 4 where 6 I {c. ,c ? ,c_,c, } for every i. [] Consider a database r. , . . . ,r over relation schemes R.,...,R , and In In n let r" ™ y a (r ) . If there are no dependencies, then r' is the i=l representative instance of the database r. , . . . ,r . When dependencies are present, the chase process should be applied to r'. Thus, if the only dependencies are those in the set of FDs F, then the representative instance is chase^(r'). The database r. ,...,r satisfies the set of FDs F if the representative instance satisfies F [Hol,Val]. Example 4 ; Consider the database scheme ,,. Note that this data- base scheme is in Boyce-Codd normal form. Suppose that the relation for ABCD is {1112}, the relation for CGDEF is {11111}, the relation for DEFB is {1111}, and the relation for BCF is empty. To obtain the representa- tive instance we have to compute the chase of the following relation. _A_ 1 1 B C D E F G 1 1 1 2 6 i 6 2 6 3 i h 1 1 1 1 1 I 1 S 7 1 1 1 6 8 - 12 - The null value 6,. can be replaced with 1 by applying the FD DEF •> B to the second and third tuples, and then 6„ is replaced with 1 by applying BC ■*■ F to the first and second tuples. No FD can be applied after that, and so the representative instance is 1 B C D E F 1 1 2 h 1 | | 1 1 1 1 1 | 1 1 6 7 1 1 1 | 8 This representative instance (and hence also the database) satisfies the FDs that are embodied in the relation schemes. However, if the relation for BCF were {112} instead of , then the representative instance would violate the FD BC ■* F . [] In the above example, the projection of the representative instance onto BCF contains the tuple 111. It may be argued that the tuple 111 over the attributes BCF does not represent a correct information, since the relation for the relation scheme BCF does not contain this tuple. However, if this argument is accepted, then it follows that two distinct relationships between the attributes B, C, and F are stored in the data- base. One relationship is stored in the relation for the relation scheme BCF, and the other relationship is obtained by the extension join CGDEF M DEFB. But this is contrary to the universal relation scheme assumption that is essential to many works in design theory for rela- tional databases. Without this assumption the axioms for functional and multivalued dependencies, and the various synthesis and decomposition algorithms cannot be used. Therefore, we believe that the representa- tive instance correctly represents the information stored in the - 13 database. 3_. 1_ Computing Total Projections of the Representative Instance In the remainder of this paper we consider a database scheme , . . . , and a corresponding database r.,..,r with a linn In representative instance r. For simplicity's sake, we assume that the relations r.,..,r do not contain null values. (Thus, null values exist 1' n only in the representative instance.) We can always decompose a relation scheme into smaller relation schemes, each having the keys of the origi- nal relation scheme and some of its attributes. Therefore, the only actual restriction implied by our assumption is that null values cannot be stored for the attributes of a key. The user formulates queries having in mind the representative instance r rather then the individual relations r, , . . . ,r . Suppose that 1 n the user is formulating a query that refers to a set of attributes X. The first step in evaluating this query is to compute the projection of r onto X. We assume that if the user is referring to the attributes in X, then he/she is interested only in tuples of r that have non-null values for all the attributes in X. Therefore, the problem addressed in this paper is how to compute the X -total projection of r, i.e., {\i I u is a tuple in it (r) without any null value} A For example, given the database of Example 4, the ACF-total projection of the representative instance is {111}. - 14 - Example 4 illustrates two surprising facts. First, the representa- tive instance does not necessarily satisfies the functional dependencies even if each relation satisfies the FDs imposed by its keys (this fact was originally pointed out in [Hoi]). Second, an X-total projection of the representative instance cannot always be computed by joining several relations of the database and then projecting onto X. In that example, the ABCDF-total projection of the representative instance is {11121}. m However, no expression of the form tl-CM R. ) has as its value the rela- tion {11121} over ABCDF. (Here, R ,...,R. are some of the relation 1 X m schemes.) In the following sections we will characterize those cases in which the problems indicated by Example 4 do not occur. 3^. 2_ The Uniqueness Condition n Consider a tuple u of ]} a. (r ) that has originated from the rela- j=l J tion r . That is, \i is non-null in the columns of R and has distinct nulls in all the other columns. Therefore, a null value in column A e U - R. of u can be replaced during the chase process (either with another null or with a non-null) only if A e R. [BDB]. The problems illustrated in Example 4 are caused by the existence of two different ways that could potentially replace a specific null value. In order to avoid such troublesome situations, we require that for each relation scheme R , there is a unique way to derive R . We will show that the representative instance of the database r , ...,r satisfies F (i.e., the set of all the embodied FDs) if and only if each R is uniquely derived. - 15 - We will define "uniqueness" after the following informal example. Example 5 : Consider the algorithm of Fig. 1. This algorithm com- putes X assuming that a cover of F is embodied in the relation schemes. Suppose the database scheme is ,,. The closure of ABC can be computed (i.e., derived) in two different ways. Either we use BD and its key B to obtain ABCD from ABC in line 3 of Fig. 1, or we use CD and its key C to obtain ABCD from ABC. If the database scheme is ,,, then there is only one way to compute (ABC) . BD is used to add D, and CE is used to add E. [] We say that R satisfies the uniqueness condition if there is no R. (j * i) such that for some set of attributes X and an attribute A (1) XA C (R ± )p_ F , and j (2) X e K and A e R - X. Example 6 : For the first database scheme of Example 5, we have F. = {A-HB,A»C}, F - {B-»D}, and F - {C-*D}. ABC does not satisfy the uniqueness condition, since CD C (ABC) where C is a key of CD and D F-F 3 is another attribute of CD. In the second database scheme of Example 5, F = {A->>B,A>C}, F = {B+D}, and F = {C+E}. The relation scheme ABC satisfies the uniqueness condition. In proof, (ABC) = ABCD and only 3 C appears also in CE and it is a key of CE; (ABC) = ABCE and only B appears also in BD and it is a key of BD. [] - 16 - begin (1) Y := X; (2) while there is a key V e K. such that V C Y and R J Y do J — J — — (3) Y := YR. end Figure 1 We now show that the uniqueness condition implies that R. is uniquely derived. Suppose that R. satisfies the uniqueness condition, and let R. C R. (j * i) . Let X = (R.)t, ^ uR.. We claim that X is a l—i l F-F . j J key of R.. In proof, if X does not contain a key of R , then it is J J impossible to have R C^ R. ; and if it properly contains a key, then the uniqueness condition is violated. If B e X, we say that R. uses B in R (sometimes we say that R. uses X in R. ). If B e R - X, we say that R adds B to R . . We also say that R. adds all its attributes to R. and l l i uses none of them. The following propositions show that if R. satisfies the uniqueness condition, then R is uniquely derived in the sense that each A e R is added by a unique R . Proposition 1 : If R satisfies the uniqueness condition and R (: R , then every A e R . is either used by R in R or added by R, to R (but not both) . Proof: Immediate from the definitions. [] - 17 - Proposition 2 : If R satisfies the uniqueness condition, then each A e R. is added by a unique R. C_ R . . Proof : Clearly, each A e R. is added by at least one R.. Suppose that for some A e R. there are distinct R and R such that both R and i q P q R add A to R. . Since R adds A, there is a key X e K such that p i q q X C (R. ) T , ^ and A e R - X. Similarly, there is a key Y e K such that — i F-F q p + ** + Y C_ (R . ) and A e R - Y. Consider a computation of R by the algo- P P rithm of Fig. 1. Case 1_. q # i and R is never used in line 3 of the algorithm. Thus, XA C_ R C_ (R . ) and, so, R. does not satisfy the uniqueness q " q condition (a contradiction). Case 2. p * i and R is never used in line 3. This is similar to P Case 1 . Case 3. R is used in line 3 after R (this includes the case q P where p = i, i.e., R is not used at all in line 3). Let Y" be the value of Y just before R is used in line 3. Since Y' C (R_, )„ _ and R J q — i F-F p q + has already been used (i.e., R C Y'), it follows that R C (R.,)^ „ . p — p — i F _ F + + ^ But A e R and, so, A e (R. )_, „ . Since X C (R. )_, _, , X e K and p i F — r — 1 r ~r Q q q A e R - X, the uniqueness condition is violated by R (a contradic- tion) . Case 4. R p is use d in i ine 3 after R This is similar to Case 3. Since at least one of q and p is different from i, no other case is possible. [] - 18 - 4. Main Theorem Let , . . . , be a database scheme, and suppose that each relation r. satisfies F and does not have any null values. In this section we will prove that the representative instance satisfies F if and only if each R. satisfies the uniqueness condition. First, we will prove two lemmas that are needed for the "if" part (which is more diffi- cult to prove). So, we assume that each R. satisfies the uniqueness n condition. Let r = ]} ql (r ) and consider the computation of chase (r). i-1 U i F Suppose that u is a tuple of r that has originated from r . We say that R. adds A to ii if R. adds A to R. . Similarly, R. uses A in u if R. uses A in R. . By the uniqueness condition and Propositions 1 and 2, for each A e R. , there are unique R. and key X e K, such that R. adds A to u i J J J using X (clealry, A e R. -X). Lemma 1 : Suppose that all the R.'s satisfy the uniqueness condi- tion. Let R C R^ , Y e K , and A e R - Y. If R either does not add A P - i P P P to R. or does not use Y in R. , then there is B e Y such that R adds B i i P _+ to R. . l Proof : Suppose that R does not add A to R . By Proposition 1, there is a key V e K used by R in R J such that A e V. If Y C V, then P P i ~ YA C V. Since A i Y, it follows that the key V properly contains the key Y of R . But this is impossible and, so, Y - V * and every B e Y - V is added by R to R+ P i Now suppose that R does not use Y. If R does not use any key P P (i.e., p=I), then R adds every B e Y. If R uses another key V e K , x ' r ' p p p - 19 - then Y (2 V (otherwise, one key properly contains another key). There- fore, there is B e Y - V that is added by R . [] We assume that chase (r) is computed by repeatedly applying the F following rule to tuples of r. FD-rule . If tuples u 1 and |i_ agree on some key X e K but disagree on some B e R - X, then for all A e R - X: (1) If |i. (A) * ^u(A) and \i. (A) is null, then replace all occurrences of u-, (A) in r with n_(A). (2) If |j. (A) * H ? (A) an d H« (A) is null, then replace all occurrences of u„(A) in r with n_(A). The above FD-rule is sufficient for computing chase (r), since a cover F of F is embodied in the database scheme. Lemma 2 ; Suppose that all the R 's satisfy the uniqueness condi- tion. Let r" be an intermediate relation in the computation of chase (r), and let \x^ t \i~ e f . (A) If u, (A) = |a«(A) = 6 for some column A, then (Al) the same R. adds A to both u- and u_, and the same key X e K is used by R. in both u. and \i~, and (A2) ^[Rj] = ^[Rj]. (B) If \i. (A) is non-null, then \i [R ] e r , where R is the unique rela- tion scheme that adds A to \i . Proof ; Induction on the number of applications of the FD-rule that are used to obtain r" from r. - 20 - Basis . Zero applications. That is, r" = r and, so, all null values in r' are distinct. Thus, part A is vacuously true. As for part B, let u. be a tuple of r' that has originated from r . If u (A) is non-null, then A e R. and R. adds A to li, . But u. [R . 1 e r. and, so, 1 i H. ^1 l i J i part B is true for r" . Induction . Suppose that r" is obtained from r by n-1 > applica- tions, and r" is obtained from r" by a single application. In particu- lar, suppose that r" Is obtained from r" by equating tuples v. and v~ using key Y of R (i.e., the FD-rule is applied to v. and v that agree p 1 z on Y in r"). By the inductive hypothesis, the lemma is true for r", and we have to show that it is true also for r' . Part A. Let u. and \x~ be tuples of r" such that u 1 (A) = u«(A) = 6. for some column A. If |a, (A) = u 2 (A) also in r", then by the inductive hypothesis, conditions (A1)-(A2) of the lemma are satisfied in r" and, hence, also in r" (if for some column B, we have u, (B) = u 2 (B) in r", then |i, (B) = |i ? (B) also in r'). Thus, we have to consider only |i., \i~ and an A such that (1) Ul (A) = v^A) in r", (2) u 2 (A) = v 2 (A) in r", and (3) v.(A) * v-(A) in r" and A e R - Y. (Otherwise, either \xAA) - ^(A) in r" or [^(A) * u 2 (A) in r' . ) Case 1. R adds A using Y to both v, and v . Since v. (A) ■ a, (A) — p 1 z. 11 in r" (and u, (A) is null in r", since it is null in r' ) , the inductive hypothesis implies that R adds A using Y also to u- , and - 21 - (i) ^[Rp] = V;L [R p ] in r". Similarly, R adds A using Y also to \i~ and (ii) n 2 [R p ] = v 2 [R p ] in r". It remains to be shown that u. [R ] ■ u^[R ] in r'. Suppose that v.. (B) is non-null in r" for some B e R - Y. Since R adds B to v,, part B of P P 1 the inductive hypothesis implies that v 1 [R ] e r in r". By (i), u [R ] e r in r" and, so, u 1 (A) is non-null in r* , contrary to assump- tion. Therefore, v, (B) is null in r" for all B e R - Y, and similarly for v~. Since v.. and v„ are equated using key Y of R , it follows that V 1 [ V = V 2 [ V ln **" By (1) and (li)> ^1 [ V = ^2 [R p ] in r '* Case 2. R does not add A to v, . We will show that this case can- p 1 not actually occur. Let r, be the relation from which v. has ori- ginated. By a lemma of [BDB], R C_ R, (since v. is equated with v~ using R ). Lemma 1 implies that there is B e Y that is added by R to v., because A is not added by R and A e R - Y (by (3)). Since v, and 1 y p p N,/x '' 1 v„ are equated using key Y of R , it follows that v 1 (B) = v«(B) in r". If v. (B) is null in R", then by part A of the inductive hypothesis, v. (A) = v~(A) in r", contrary to (3). If v. (B) is non-null in r", then by part B of the inductive hypothesis, v. [R ] e r in r" and, by (1), (j.. (A) is non-null in r' , contrary to assumption. Thus, v.. (B) is neither null nor non-null in r" and, consequently, this case cannot occur. Case 3. R does not add A to v~. This is similar to Case 2. p 2 Part B. We have to show that if \i~ (A) is non-null in r* , then u.[R.] e r., where R adds A to \L. . If u. (A) is also non-null in r", - 22 - then by the inductive hypothesis, u. [R ] e r in r" and, so, u.. [R ] is the same in r" and f (since all the columns of u 1 [R ] in r" are non- null and cannot be changed). Thus, we have to consider only the follow- ing case. (a) v. (A) is non-null in r", (b) v-(A) is null in r", (c) u, (A) = v„(A) in r", and (d) v. and v are equated using a key Y of R and A e R - Y. 12 P P Claim 1 : v, [R ] e r in r" (and, hence, also in r'). Proof : Case 1_. R adds A to v, . Since v. (A) is non-null in r" (by (a)), the inductive hypothesis implies that v. [R ] e r in r". There- fore, v. [R ] is non-null in r" and, hence, it is the same in r" and r" . Case 2_. R does not add A to v. . By (d) and Lemma 1, there is B e Y that is added by R to v.. Suppose that v. (B) is null. Since v. (B) = v„(B) in r" (by (d)), part A of the inductive hypothesis implies that v 1 (A) ■ v„(A) in r". This is contrary to (a) and (b) and, there- fore, v 1 (B) is non-null in r". By part B of the inductive hypothesis, v [R ] e r in r". [] 1 p p Claim 2: R uses Y in v.. p 2 Proof: Suppose that R does not use Y in v„. By Lemma 1, there is P ^ B e Y such that R adds B to v^. By (d), v 2 (B) = v ] (B) in r" and, by Claim 1, v»(B) is non-null in r". By the inductive hypothesis, v„ [R e r in r". But this is impossible, since v (A) is null in 2 p p 2 - 23 - [] By Claim 2 and (d), R adds A to v« using Y. Therefore, part A of the inductive hypothesis, (b) and (c) imply that \x. [R ] ■ v~[R ] in r" (and R adds A to la, using Y). If v_(C) is non-null in r" for some pi 2 C e R - Y, then by the inductive hypothesis, v„(A) is non-null, con- trary to (b) . Hence, v (C) is null in r" for all C e R - Y and, by 1 P (d), v [R ] = v 9 [R ] in r'. Thus, \x [R ] = v ? [R ] = v. [R ] e r in 1 p 2 p lpZplpp r'. [] Corollary 1 : Suppose that during the computation of chase (r), the r null in column A of \i. is replaced with a non-null as a result of equat- ing v. and v« using key Y of R . Then R adds A to \i. using Y. Proof ; Let v., v„, and |i 1 be the same as in part B of the proof of Lemma 3. We have shown that R adds A to \i. using Y. [] Theorem 1: Let , . . . , be a database scheme. Suppose linn that each relation r satisfies F and does not have any nulls. Then the representative Instance satisfies F if and only if each R satisfies the uniqueness condition. Proof : If . Suppose that some FD X -»• A e F is violated in the representative instance. That is, there are tuples u. and n_ in the representative instance such that (1) n x [X] = u 2 [X], and (2) n 1 (A) and n«(A) are distinct non-nulls. - 24 Claim 1 ; n,[R.] e r. and n 2 [R.] e r . Proof : We will prove that u,[R.] e r (proving that ^[R.] e r is similar). Case 1. R. adds A to \x. . By Lemma 2, \i [R ] e r . Case 2_. R. does not add A to \i. . By Lemma 1, there is B e X such that B is added by R. to u. . Suppose that u. (B) is null. By Lemma 2, |i, (A) = |i_(A), since u- (B) = |a~(B). This contradiction implies that \x. (B) is non-null and, by Lemma 2, u, [R. ] e r . [] By Claim 1 and (1), \i. (A) = |i_(A), since r satisfies F . But this is contrary to (2) and, so, the "if" part is proved. Only If . Suppose that some R. does not satisfy the uniqueness con- dition. That is, there is an R. (j * i) , a key X e K , and an attribute A e R. - X such that XA C (R J )„ „ . We have to show that there are — i F-F . J relations r.,...,r (without nulls) that satisfy F.,...,F , respec- tively, such that the representative instance does not satisfy F. We choose r. , . . . ,r as follows. For k t j, each r, has exactly one tuple that maps all the attributes of R, to 1. The relation r has exactly one tuple that maps A to 2 and each attribute in R - A to 1. Let n J r = y a (r ), and let u, be the tuple of r that has originated from r . ..UK.cC k. k=l We apply the chase process to r using only the FDs of F - F. and tuples u, such that k * j. Let r' be the resulting relation. By a Lemma of [BDB], ^.(B) ■ 1 for all B e (R ) in r'. Therefore, tuples |i and i i F- Fj i \i of r' violate X -*■ A. Clearly, if we have a violation of X ■> A in r', - 25 - then this violation exists also in the representative instance and, so, the proof is complete. [] 5_. Computing Total Projections of the Representative Instance Suppose that , . . . , is a database scheme such that linn each R. satisfies the uniqueness condition, and let X be a set of attri- butes. In this section we will show how to construct in polynomial time an expression E whose value is the X-total projection of the representa- tive instance. The expression E is of the form ]}n (E ), where each E. j X j j is an extension join. Among all the expressions of this form whose value is the X-total projection of the representative instance, the m expression E is minimal in both the number of extension joins and the number of join operators. Let r = y. a (r ), and u e chase (r) be a tuple that has originated i=1 U I F from r . We define P = {A I u(A) is non-null} Consider the set S = {R. I there is an A e P such that R. adds A to u} By Lemma 2, P = R ...R , where S = {R, ,...,R. }. Let R. and R be J l J m J l J m J p J q distinct members of S. By Corollary 1 and Lemma 2, the columns of u for the attributes added by R become non-null as a result of a single J P application of R. , and similarly for R. . Therefore, the columns of J p J q both R and R. cannot become non-null simultaneously (because R. and J P J q P R. add distinct attributes). Consequently, let R. ,R. ,...,R. be an J q J l h J m - 26 - ordering of the elements of S such that if p < q, then u[R ] becomes P non-null before u[R. 1. Note that i, = i. v l m m Lemma 3 ; b<| R, , then by a lemma of [BDB], k=l J k k=l J k there is a tuple v in the representative instance such that v[P] ■ |i. Therefore, we have the following corollary. Corollary 2 : If all the R 's satisfy the uniqueness condition, then the X-total projection of the representative instance is given by the expression yn Y (E.), where each E is an extension join over a set of j X j j attributes containing X. The expression E of Corollary 2 might have redundant subexpres- sions. We now show how to minimize it. Clearly, there is an extension - 27 - join of R over a set of attributes containing X only if XCR. . J l p J l Furthermore, if M R. (p < m) is also an extension join of R. over a k=l J k m p J l set of attributes containing X, then it v (P<\ R such that k=l J k (1) XCR . ..R , and J l J m (2) no proper subsequence of R. , ...,R. is an extension join of R. J l J m X l that satisfies (1). Lemma 4 ; If XCR,, and R. , . . . ,R satisfy the uniqueness condition, then the minimal extension join of R with respect to X is unique and can be found in linear time. Proof : The algorithm of Fig. 2 computes a set S of relation schemes that must be Included in any extension join of R. over a set of attri- butes containing X. The idea is that S must contain any R that adds an attribute A e X to R and, recursively, if R, e S, then S includes also every R. that adds an attribute of R, . Now consider an ordering R. ,...,R. of the elements of S such that if k < m, then R, is used j l J P + before R in line 3 of Fig. 1 during the computation of R . . (Note that m i 3 X = iO P Claim 1 : t>; (2) Y := X; (3) while there is an R. such that R. t S and R. adds some A e Y to R. do J J J i — begin (4) S := S \} {R }; (5) Y := YR. J end end Figure 2 P Therefore, if B e Y, then B e V and, so, (>, . . . , be a database scheme such that 11' n n each R satisfies the uniqueness condition. Suppose that each relation r satisfies F and does not have any nulls. An expression whose value is the X-total projection of the representative instance can be found in 2 0(n ) time, where n is the space needed to write down the database scheme. Proof : The algorithm for constructing the expression is given in Fig. 3. For a given value of T, define E(T) = {E, | E, is the minimal extension join of R, with respect to X, and R, e T} Let F(T) be the union of all extension joins in E(T) projected onto X. - 30 - begin (1) W := 4>; (2) Let T = {R ± | X C R*}; (3) for every R. e T do begin (4) let E be the minimal extension join of R with respect to X; (5) if there is no R. e T (j * i) that appears in E. then (6) add E ± to W (7) else remove R. from T i end; (8) let W = {E.,...,E }; l m m (9) return the expression \} il.(E.) end Figure 3 Using Lemma 5, we can easily prove by induction that for all values of T obtained in line 3, F(T) = F(T Q ), where T is the initial value of T. By Corollary 2 and Lemma 4, the value of F(T n ) is equal to the X-total projection of the representative instance. The expression returned by the algorithm of Fig. 3 is F(T ) , where T is the final value of T. Thus, the algorithm returns an expression whose value is the X-total projection of the representative instance. By Lemma 4 and [BB], the algorithm of Fig. 9 can be implemented to 2 run In 0(n ) time. - 31 - Corollary 3 : Among all the expressions of the form tyit (E ), whose J X j value Is the X-total projection of the representative instance, the expression returned by the algorithm of Fig. 3 is minimal in both the number of subexpressions of the form it„(E ) and the total number of join operators. Proof : The expression returned in line 9 is nonredundant in the sense that if any subexpression of the form it (E.) is removed from the union, then we can find relations r.,...,r for which the value of the In resulting expression is not the X-total projection of the representative instance. In proof, for each R that appears in E , the relation r has exactly one tuple with 1 in every column; all the other relations are empty. Clearly, the value of E. is a relation with exactly one tuple that has 1 in every column. All the other extension joins have at least one empty relation and, so, their value is the empty relation. Thus, removing "^(E ) changes the value of the expression from a relation with one tuple to the empty relation. By a theorem of [SY], it follows that the expression returned in line 9 is minimal in the number of subexpres- sions of the form it^(E ). Since each E has no equivalent expression with fewer join operators (by Lemma 4), the expression returned in line 9 is also minimal in the total number of join operators (by a result of [SY]). [] Example 7 : Suppose that the attributes are P(pro ject) , D( department), M(manager), L(location), and A(assistant), and the data- base scheme is ,. - 32 - Intuitively, the database scheme describes an application in which each project belongs to several departments and is carried out in several locations, but a department can have only one project in each location. In each department participating in a project, there is a manager responsible for that project. Each manager has an assistant in each location. These relation schemes satisfy the uniqueness condition. Suppose we want to compute the total projection of the representa- tive instance onto LM. After Step (1) of the algorithm, S = {LDP,LMA}. The minimal extension join of LDP with respect to LM is LDP M PDM, and the minimal extension join of LMA with respect to LM is LMA. In Step (2) of the algorithm S is not changed, and so an expression for the LM- total projection of the representative instance is Tt LM (LDP M PDM) \> ti lm (LMA). The result of the above expression is all tuples (l,m) such that either manager m has an assistant in location 1 or manager m manages some project in location 1. Considering the fact that there might be partial information (e.g., a manager with a project in a location where he does not have an assistant, or a manager with an assistant in a loca- tion where he does not have a project), the correct answer is indeed given by the above expression. [] - 33 - 6. Conclusions We have proposed the representative Instance as a measure of the data stored In the database, and we have characterized, in terms of the uniqueness condition, the database schemes for which the representative instance always satisfies the functional dependencies. The uniqueness condition can be viewed as an extension of Boyce-Codd normal form, since it removes inter-relation anomalies in the sense that each relation can be updated independently of the contents of the other relations without violating global consistency. (It is an extension of Boyce-Codd normal form, since a relation scheme that satisfies the uniqueness condition is also in Boyce-Codd normal form.) This condition is much less restrictive than [BG] , and we believe that many practical applications satisfy it. We have also shown how to compute efficiently total projections of the representative instance if the uniqueness condition is satisfied. In [Sa] we have dealt with database schemes that do not satisfy the uniqueness condition. We have shown that the representative instance can be guaranteed to satisfy the functional dependencies if the modified foreign-key constraint is imposed on the database. The modified foreign-key constraint is less restrictive and semantically more mean- ingful than imposing the existence of a universal instance. Under this constraint, total projections of the representative instance can be com- puted by performing the union of several extension joins. - 34 - References [ABU] Aho, A. V. , C. Beeri, and J. D. Ullman, "The Theory of Joins in Relational Databases," ACM Trans , on Database Systems , Vol. 4, No. 3 (Sept. 1979), pp. 297-314. [ASU] Aho, A. V. , Y. Sagiv, and J. D. Ullman, "Equivalences Among Rela- tional Expressions," SIAM J. Computing , Vol. 8, No. 2 (May 1979), pp. 218-246. [Arm] Armstrong, W. W., "Dependency Structures of Database Relation- ships," Proc . IFIP 74 , North Holland, 1974, pp. 580-583. [BB] Beeri, C. and P. A. Bernstein, "Computational Problems Related to the Design of Normal Form Relational Schemas," ACM Trans , on Database Systems , Vol. 4, No. 1 (March 1979), pp. 30-59. [Ber] Bernstein, P. A., "Synthesizing Third Normal Form Relations from Functional Dependencies," ACM Trans , on Database Systems , Vol. 1, No. 4 (Dec. 1976), pp. 277-298. [BG] Bernstein, P. A., and N. Goodman, "What Does Boyce-Codd Normal Form Do?," Proc . Int . Conf. on Very Large Data Bases , 1980, pp. 245-259. [BDB] Biskup, J. , U. Dayal, and P. A. Bernstein, "Synthesizing Indepen- dent Database Schemas," Proc . ACM-SIGMOD Int . Conf . on Management of Data , 1979, pp. 143-151. [CI] Codd, E. F., "A Relational Model for Large Shared Data Banks," Comm . ACM , Vol. 13, No. 6 (June 1970), pp. 377-387. [C2] Codd, E. F., "Extending the Database Relational Model to Capture More Meaning," ACM Trans , on Database Systems , Vol. 4, No. 4 (Dec. 1979), pp. 397-434. [FMU] Fagin, R., A. 0. Mendelzon, and J. D. Ullman, "A Simplified Universal Relation Assumption and Its Properties," IBM research report, RJ2900, Nov., 1980. [Hoi] Honeyman, P., "Testing Satisfaction of Functional Dependencies," Proc . XP1 Conf. , Stonybrook, N. Y., June, 1980. [Ho2] Honeyman, P. , "Extension Joins," Proc . Int . Conf . on Very Large Data Bases , 1980, pp. 239-244. [HLY] Honeyman, P., R. E. Ladner, and M. Yannakakis, "Testing the Universal Instance Assumption," Information Processing Letters , Vol. 10, No. 1 (Feb. 1980), pp. 14-19. [Ko] Korth, H. F., "A Proposal for the SYSTEM/U Query Language," unpub- lished memorandum, Stanford University, Stanford, CA, 1980. [KU] Korth, H. F. and J. D. Ullman, "System/U: a Database System Based on the Universal Relation Assumption," Proc . XP1 Conf . , Sto- nybrook, N. Y., June, 1980. [Li] Lien, Y. E., "Multivalued Dependencies With Null Values in Rela- tional Databases," Proc . Int . Conf . on Very Large Data Bases , 1979. [Ma] Maier, D. , "Discarding the Universal Instance Assumption: Prelim- inary Results," Proc . XP1 Conf ., Stonybrook, N. Y., June, 1980. [MMS] Maier D. , A. 0. Mendelzon, and Y. Sagiv, "Testing Implications of Data Dependencies," ACM Trans , on Database Systems , Vol. 4, No. 4 (Dec. 1979), pp. 455-469. [MU] Maier, D. , and J. D. Ullman, "Maximal Objects and the Semantics of - 35 - Universal Relation Databases," Tech. Rept. 80-016, Dept. of Com- puter Science, State University of New York at Stony Brook, Stony Brook, NY, Nov. 1980. [Sa] Sagiv, Y. , "Can We Use the Universal Instance Assumption Without Using Nulls?" Proc . ACM-SIGMOD Int . Conf . on Management of Data , Ann Arbor, April 1981, pp. 108-120. [SY] Sagiv, Y., and M. Yannakakis, "Equivalences Among Relational Expressions with the Union and Difference Operators," »J. ACM , Vol. 27, No. 4 (Oct. 1980), pp. 633-655. [Sc] Sciore, E., "The Universal Instance and Database Design," TR 271, Dept. of Elec Eng. and Comp. Sci., Princeton University, Prince- ton, N. J. , June, 1980. [Val] Vassiliou, Y. , "Functional Dependencies and Incomplete Informa- tion," Proc . Int . Conf . on Very Large Data Bases , 1980, pp. 260-269. [Va2] Vassiliou, Y. , "A Formal Treatment of Imperfect Information in Database Management," Technical Report CSRG-123, University of Toronto, Nov. , 1980. BIBLIOGRAPHIC DATA SHEET 1. Report No. ijtitcdcs-r-si-ios? 3. Recipient's Accession No. 4. Title and Subtitle A Characterization of Globally Consistent Databases and their Correct Access Paths 5. Report Date July 1981 7. Author(s) Yehoshua Sagiv 8. Performing Organization Rept. No. Performing Organization Name and Address Department of Computer Science University of Illinois 1304 W. Springfield Urbana, IL 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. MCS- 80-03308 12. Sponsoring Organization Name and Address National Science Foundation Washington, D.C. 13. Type of Report & Period Covered 14. 15. Supplementary Notes 16. Abstracts \j e propose the representative instance as a representation of the data stored in a database whose relations are not the projections of a universal instance. We characterize the database schemes for which local consistency implies global consis- tency. (Local consistency means that each relation satisfies its own functional depen- dencies, and global consistency means that the representative instance satisfies all the functional dependencies.) We show how to compute efficiently projections of the representative instance provided that local consistency implies global consistency. Throughout this work we assume that a cover of the functional dependencies is embodied in the database scheme in the form of keys. 17. Key Words and Document Analysis. 17a. Descriptors 17b. Identifiers/Open-Ended Terms 17c. COSAT1 Field/Group 18. Availability Statement 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages -22. 22. Price FORM NTIS-35 (10-70) USCOMM-OC 40329-P7I