wijsen-2005-1-ACMTODS.pdf

download wijsen-2005-1-ACMTODS.pdf

of 47

Transcript of wijsen-2005-1-ACMTODS.pdf

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    1/47

    Database Repairing Using Updates

    JEF WIJSEN

    Universite de Mons-Hainaut

    Repairing a database means bringing the database in accordance with a given set of integrityconstraints by applying some minimal change. If a database can be repaired in more than one way,then the consistent answer to a query is defined as the intersection of the query answers on allrepaired versions of the database.

    Earlier approaches have confined the repair work to deletions and insertions of entire tuples.We propose a theoretical framework that also covers updates as a repair primitive. Update-basedrepairing is interesting in that it allows rectifying an error within a tuple without deleting thetuple, thereby preserving consistent values in the tuple. Another novel idea is the construct ofnucleus: a single database that yields consistent answers to a class of queries, without the need forquery rewriting. We show the construction of nuclei for full dependencies and conjunctive queries.

    Consistent query answering and constructing nuclei is generally intractable under update-basedrepairing. Nevertheless, we also show some tractable cases of practical interest.

    Categories and Subject Descriptors: H.2.4 [Database Management]: SystemsRelationaldatabases;Query processing; H.2.3 [Database Management]: LanguagesQuery languages

    General Terms: Theory

    Additional Key Words and Phrases: Consistent query answering, database repairing

    1. GENERAL PROBLEM DESCRIPTION

    Database textbooks generally explain that integrity constraints are used forcapturing the set of all legal databases and hence should be satisfied at alltimes. Nevertheless, many real-life databases contain data that is known or

    suspected to be inconsistent. Inconsistency may be caused, among other rea-sons, by data integration and underspecified constraints. For example, the ruleNo employee has more than one contact address, gives rise to an error iftwo databases to be integrated store different addresses for the same em-ployee. A FIRSTNAME CHAR(20) declaration in SQL does not protect us frominputting illegal first names like Louis14. When later on we specify thatfirst names cannot contain numbers, the database may already turn out to beinconsistent.

    Authors address: Universite de Mons-Hainaut, Avenue du Champs de Mars 6, B-7000 Mons,Belgium; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercial

    advantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected] ACM 0362-5915/05/0900-0722 $5.00

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005, Pages 722768.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    2/47

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    3/47

    724 Jef Wijsen

    Fig. 2. F1 is a fix; the corresponding uprepairs are obtained from T1 by substituting for x anyvalue distinct from sedan. Likewise forF2F5.

    This is an example of a contradiction-generating dependency (cgd) and canbe stated as: xy z(R(x, luxury,z ) R(x, y , bottom)).

    Note that because of 1, we can substitute luxury for the variable y in 1without changing the meaning of the constraints.

    The relationCARSfalsifies1, because it does not contain the tuple sedan,

    standard, medium. Adding that tuple results in a violation of constraint 2. Indeletion/insertion-based repairing, we can delete the first tuple ofCARS; alter-natively, we can delete the second tuple and insert sedan, standard, medium.In update-based repairing, erroneous values are replaced by variables in so-called fixes. Five fixes F1F5 ofCARS are shown in Figure 2. The fix F1, forexample, assumes that the value sedan in the first tuple of CARS is mis-taken. Each fix is homomorphic to the original relation CARS: one can findsubstitutions for the variables x , y ,z that map each fix into CARS. Moreover,each fix is homomorphic to a consistent relation. In fact, F1 is homomorphictoT1, and to every relation obtained from T1 by substituting a constant for x ;if we choose a constant distinct from sedan, then the relation obtained willbe consistent and called an uprepair. The prefix up distinguishes uprepairsfrom other repair notions in the literature; it may be read as update or it

    can refer to the fact that passing from fixes to uprepairs corresponds to mov-ing up in a homomorphism lattice (see later). Significantly, uprepairs, unlikefixes, do not contain variables. To make F5 consistent, we first add the tuplesedan, standard, medium because of1, and then we identify z and mediumbecause of 2. In this way, we obtain the uprepair T5. Intuitively, the repairwork that modifiesCARSintoT5updates lower into medium.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    4/47

    Database Repairing Using Updates 725

    In general, the number of uprepairs can be large or even infinite. The con-sistent answer to a query is defined as the intersection of the query answerson all uprepairs. In the running example, T

    1gives rise to as many uprepairs as

    there are possible Model-values distinct from sedan. The question Give allversions yields luxury and standard. The question Give all lower pricedversions yields the empty answer, because the uprepair T5 contains no lowerpriced cars.

    The definition of consistent answer is conceptually clean but, since the num-ber of uprepairs can be infinite, it is not clear whether consistent answers canbe effectively computed and if so, whether they can be computed in tractabletime. In this article, all complexity results refer to data complexity: complexityis in terms of the cardinality of the input relation.

    A well-studied strategy for computing consistent query answers, known asquery rewriting,consists in pushing integrity constraints into queries so as toobtain new queries that are guaranteed to return the consistent answer on any,

    possibly inconsistent, database. Since query rewriting is independent of theunderlying database, it gives us tractability provided that the modified queriesexecute in polynomial time.

    In this article, we explore a different route. We show that for full dependen-cies and conjunctive queries, all uprepairs can be summarized into a singletableau G such that the consistent answer to any conjunctive query can beobtained by executing the query on G. The tableau G, called nucleus, will behomomorphic to all uprepairs and will be maximal in the sense that any othertableau that is homomorphic to all uprepairs, is also homomorphic to G. Anucleus for the running example is the following tableauG :

    G Model Version PriceRangex luxury ux standard v

    sedan standard w

    The queries Give all versions,and Give all lower priced versions on G, indeedyield the same consistent answers as before. Note incidentally the differencewith deletion/insertion-based repairing, where the repair obtained by deletingthe first tuple ofCARScontains no luxury cars.

    Rewriting conjunctive queries takes time, and more significantly, usuallyresults in nonconjunctive queries that are more time-consuming. The use ofnuclei eliminates the need for query rewriting. A nucleus could be computedonce and then used to compute consistent answers to any conjunctive query.The drawback is that the nucleus has to be recomputed when the underlyingdatabase is modified. Of course, for classes of queries and constraints for whichconsistent query answering is tractable, the use of nuclei is only meaningful

    if they can be computed in polynomial time. In this article, we identify caseswhere the nucleus construction needs no more time than the time needed toanswer a single query obtained from rewriting.

    The rest of this article is organized as follows. Section 2 recalls some databaseconstructs and introduces the notion of uprepair. In general, the set of upre-pairs can be infinite. Section 3 shows how to effectively query infinitely many

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    5/47

    726 Jef Wijsen

    uprepairs in the case of full dependencies and conjunctive queries. Section 4shows the existence of nuclei and provides an effective way to compute them.The results in Section 24 are general in the sense that they do not depend ona particular way of fixing, but simply assume that some set of fixes is given.Section 5 settles the construct of fix and then explores an upper bound for thetime complexity of consistent query answering. Section 6 provides results on thenucleus size and the time complexity for computing it. Two classes of dependen-cies are investigated in detail: key dependencies (Section 7) and contradiction-generating dependencies (Section 8). For both classes, we clarify the frontierbetween tractability and intractability. Section 9 shows the problems arisingwhen we add inequalities or disjunctions. Finally, Section 10 contains a com-parison with existing work.

    2. FROM SUBCONSISTENT TABLEAUX TO CONSISTENT RELATIONS

    2.1 PreliminariesWe recall the definition of tableau. To simplify the notation, we will assume aunirelational database containing a single relation of arityn.

    Definition 2.1. We assume two disjoint, infinite setsdom and var ofcon-stantsand variables. Asymbolis either a constant or a variable.

    We assume an arity n. Atupleis a sequence p1, . . . , pn of symbols. Theithcoordinate of tuplet is denotedt (i) (i {1,2, . . . , n}).

    A tableau is a finite set of tuples. A tuple or tableau without variables iscalledground. A ground tableau is also called arelation. IfFis a tableau, thengrd(F) := {t F | t is ground}. The set of all tableaux (of fixed arity n) isdenoted T.

    We will use the lettersI, J,Kto denote relations;F,G , H,Tdenote tableaux;I denotes a set of relations; F, G denote sets of tableaux. We now recall thedefinition of substitution and its extension to tableaux.

    Definition 2.2. Asubstitution is a mapping from variables to symbols,extended to be the identity on constants. Substitutions naturally extend totuples and tableaux: first, (p1, . . . , pn) = (p1), . . . , (pn), and second, if Fis a tableau, then (F) = {(t) | t F}.

    We writeid for the identity function on symbols, and write idp = q , where pandq are not two distinct constants, for a substitution that identifies pand qand that is the identity otherwise. That is, if pis a variable and q a constant,then idp = q = {p/q}; similarly, mutatis mutandis, if p is a constant and q avariable. If pandq are variables, thenidp = q can be either {p/q} or {q/p}; thechoice will be immaterial in the technical treatment.

    The following definition introduces an order on T based on tableauhomomorphism.

    Definition 2.3. Let F, G be two tableaux (i.e. F,G T). Ahomomorphismfrom F to G is a substitution for the variables in F such that (F) G. Ifsuch a homomorphism from FtoG exists, thenFis said to behomomorphicto

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    6/47

    Database Repairing Using Updates 727

    G, denoted G F. Two tableaux are said to beequivalent, denoted F G, ifand only ifF G and G F. We writeG Fif and only ifG FandG F.The relation is an equivalence relation (Lemma 2.4); we write [F]

    for the

    equivalence class of that contains F. A tableau is reducedif it is equivalentto no tableau of smaller cardinality.

    The relation on tableaux naturally extends to sets of tableaux: two sets Fand G of tableaux are said to be equivalent, denoted F G, if and only if forevery tableau in either set, there exists an equivalent tableau in the other set.

    It is well-known [Abiteboul et al. 1995], and significant for the results to follow,that if a tableau Fis not reduced, then there exists a substitution such that(F) F. Also, homomorphisms are known to be intimately related to logicalimplication. Each tableau G can be associated to a first-order logic sentence,denotedwff(G), defined aswff(G) := (

    tG R (t)). Then,G Fif and only if

    wff(G) logically implieswff(F). Consequently,G Fif and only ifwff(G) andwff(F) are logically equivalent. The following properties are straightforward:

    LEMMA 2.4. Let F,G T.

    (1) If F G, then F G .

    (2) If F G, thengrd(F) grd(G).

    (3) T, is a quasi-order.1

    (4) is an equivalence relation onT.

    PROOF. Easy.

    2.2 Uprepairs

    Consistency of relations is defined relative to a set of integrity constraints. Atableau will be calledsubconsistent if it is homomorphic to a consistent relation.

    Our approach to repairing an inconsistent relationIconsists in finding subcon-sistent fixes that are one-one homomorphic toIand homomorphic to consistentuprepairs. The actual relationship between Iand its fixes will be specified inSection 5. Up to that point, we will assume that a set F of subconsistent fixesis given and focus on the problem of querying the uprepairs generated by F.

    Definition 2.5. We assume arityn. Aconstraint is a closed first-order for-mulausing a single predicate symbol of arityn. IfJis a relation, then J |= denotes that Jsatisfies(according to standard first-order logic semantics).

    Let be a set of constraints. We write J |= if and only if J |= for each . The relationJis called consistent (with respect to)ifandonlyifJ |= ;otherwiseJis inconsistent.Theset is calledsatisfiable if it allows a consistentrelation. The setis called coherent if it allows a nonempty consistent relation.Two sets of constraints are equivalent if any relation that is consistent withrespect to either set, is also consistent with respect to the other set.

    A tableau F issubconsistent(with respect to ), denotedF , if and onlyifFis homomorphic to a consistent relation: J Ffor some consistent relationJ.

    1A quasi-order satisfies reflexivity and transitivity.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    7/47

    728 Jef Wijsen

    LetFbe a subconsistent tableau. An uprepairofFand is a minimal (withrespect to ) consistent relation J satisfying J F. In general, there willbe more than one uprepair of F and ; the set of all uprepairs of F and isdenotedF. The latter operator naturally extends to a setFof subconsistenttableaux:

    F :=

    FF

    F .

    If J F, then we also say that J isgeneratedby F and .

    Note that F need not be finite. For example, for the subconsistent tableau F1of Figure 2, F1

    {1,1,2,3} contains every relation obtained from T1 by replacingx by a constant distinct from sedan.

    Equivalent sets of subconsistent tableaux generate the same uprepairs.

    LEMMA 2.6. Let be a set of constraints. Let F, G be sets of subconsistenttableaux such thatF G. Then,F = G.

    PROOF. Assume J F. We can assume the existence of F F such thatJ Fand for each relation J, J J F implies J |= . Since F G,we can assume the existence of G G such that F G, hence F G andG F. By transitivity, J G . Assume a relation J such that J J G . Bytransitivity, J J F, hence J |= . It is correct to conclude J G. ItfollowsF G. By symmetry,G F.

    3. QUERYING THE UPREPAIRS

    In this section, we study an initial problem underlying consistent query answer-

    ing: given a finite set F of subconsistent tableaux and a query , compute theintersection of the query answers on each J F, i.e. compute

    JF(J).

    3.1 Tableau Queries and Full Dependencies

    We use the tableau formalism to express conjunctive queries [Abiteboul et al.1995]. The answer to a tableau query on a set of relations is defined as theintersection of the query answers on each relation of the set.

    Definition 3.1. Atableau queryis a pair (B,h) where B is a tableau (calledbody) andh is a tuple (called summaryor head) such that every variable in halso occurs in B; B and h need not have the same arity. The class of tableauqueries is denotedCQ (acronym for Conjunctive Queries).

    Let = (B,h) be a tableau query, and Fa tableau of the same arity as B.

    The answer to on input F, denoted (F), is the tableau defined as (F) :={(h) | is a substitution and(B) F}. Obviously, if F is ground, then so is(F).

    A tableau query = (B,h) is called Boolean ifhcontains no variables; wewill often choose h = 1, where 1 dom. Intuitively, the answer {1}can bethought of astrue, and the answer {} asfalse.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    8/47

    Database Repairing Using Updates 729

    Fig. 3.

    For a set I of relations and a tableau query, we define:

    (I) :=II

    (I) .

    Certain results in this article are limited to full dependencies, a restricted butinteresting class of database constraints [Abiteboul et al. 1995]. Unlike theseminal paper by Beeri and Vardi [1984], our full dependencies need not betyped and can contain constants. All constraints of our running example (seeFigure 1) are full dependencies.

    Satisfaction of full dependencies by relations is borrowed from first-orderlogic semantics. We also define what it means for a tableau to satisfy a setof full dependencies. We are not aware of existing work introducing the sameconcept.

    Definition 3.2. Afull dependency is either a full tuple-generating depen-dency(ftgd) or afull equality-generating dependency(fegd).

    An ftgd takes the form of a tableau query (B,h) where B and h have thesame arity. The ftgd = (B,h) is satisfied by a tableau F, denoted F |= , ifand only ifF (F) F.

    An fegd is of the form (B, p= q ) where Bis a tableau and p, q are symbolssuch that every variable in {p, q} also occurs in B. The fegd = (B, p = q) is

    satisfied by a tableauF, denotedF |= , if and only if for every substitution , if(B) F, then(p), (q) are not two distinct constants and F id(p) = (q)(F).

    Consistency (see Definition 2.5) carries over from relations to tableaux: atableauFis calledconsistentwith respect to a set of full dependencies if andonly ifF |= .

    IfP is a tableau, a tableau query, an ftgd, or an fegd, then the active symboldomainofP , denotedasd(P), is the set of symbols occurring in P .

    Satisfaction of full dependencies by tableaux needs special attention, as illus-trated by the tableaux F and G, and the dependencies 4 and 2 in Figure 3.Since F idx = y (F) F 2(F), Fsatisfies both 4 and 2. Note that inverifying whetherFsatisfies4, one must not consider Fas a relation by inter-pretingx and y as distinct constants. Since G idx = y (G) andG G 2(G),G satisfies neither4nor 2.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    9/47

    730 Jef Wijsen

    Clearly, if a relation I is consistent with respect to a set of fegds, then sois every subset of I. This is no longer true for tableaux, since G F, F |=

    4, but G |=

    4. Although this behavior may be somewhat counterintuitive

    at first sight, we can give two arguments in favor of it. First, our definition ofsatisfaction ensures that equivalenttableaux satisfy the same full dependencies(see Theorem 3.7), which is obviously a nice property. Note that F {z,z }:the tuple x, y is redundant in F. In first-order logic terms, xy z(R(x, y) R(z,z)) is logically equivalent to z(R(z,z )). Second, this behavior is coherentwith querying. Consider the Boolean tableau query:

    1 2v v

    1.

    Intuitively, since 4expresses xy (R(x, y ) x = y ), every nonempty tableauHsatisfying4necessarily contains a tuple tsuch that t(1) = t(2), hence(H) =

    {1}. Since G is nonempty but (G) = {}, we conclude that G must falsify 4.On the other hand, (F) = {1}. Note incidentally that no tableau query candifferentiate betweenFand {z,z }: foreverytableau query, (F) ({z,z }).

    We now prove three lemmas that are needed in the main theorem to follow,which expresses that equivalent tableaux satisfy the same full dependencies.The operator denotes function composition.

    LEMMA 3.3. Let F be a tableau, anda substitution for the variables in F .Let = (B,h)be a tableau query. Then ((F)) ((F)).

    PROOF. Let t ((F)). We can assume s (F) such that (s) = t. Thereexists a substitutionsuch that (B) Fand (h) = s. Hence, (B) (F).Hence, (h) = t ((F)).

    LEMMA 3.4. Let F , G be tableaux and a tableau query. If F G, then(F) (G)and F (F) G (G).

    PROOF. AssumeF G . We can assume a substitution such that (G) F.Itis easyto see ((G)) (F). By Lemma 3.3, it follows ((G)) (F). Hence,(F) (G). From (G) F and ((G)) (F), it follows (G (G)) F (F). Hence, F (F) G (G).

    COROLLARY 3.5. Let F, G be tableaux anda tableau query. If F G, then(F) (G)and F (F) G (G).

    LEMMA 3.6. Let F , G be tableaux, and p, q symbols. Let be a substi-tution such that (p), (q) are not two distinct constants. If (G) F, thenid(p) = (q)(F) idp = q (G).

    PROOF. Since (p) and (q) are not two distinct constants, p and q arenot two distinct constants either. From the premise (G) F, it followsid(p) = (q)((G)) id(p) = (q)(F). It is easy to verify that id(p) = (q) =id(p) = (q) idp = q . Hence,id(p) = (q) (idp = q (G)) id(p) = (q)(F). Hence,id(p) = (q)(F) idp = q (G).

    THEOREM 3.7. Equivalent tableaux satisfy the same full dependencies.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    10/47

    Database Repairing Using Updates 731

    PROOF. Let F, G be two tableaux such that F G: F G and G F.AssumeF |= for some ftgd :F F(F). By Lemma 3.4,F(F) G (G).By transitivity, G G (G). It follows G |= .

    Assume F |= for some fegd =(B, p = q). Since F G , we can assume asubstitutionsuch that (G) F. Let be any substitution with (B) G.Hence, (B) F. Since F |= , (p) and (q) are not two distinctconstants, andF id(p) = (q)(F). Clearly,(p)and(q) are not two distinctconstants either. By Lemma 3.6, id(p) = (q)(F) id(p) = (q)(G). SinceG F, it follows G id(p) = (q)(G) by transitivity. Since id(p) = (q)(G) G isobvious, it followsG |= .

    3.2 The Chase

    The chase was originally introduced as a tool for deciding logical implication;here we use it for repairing databases. We generalize some results of Beeriand Vardi [1984] to tableaux that can contain constants and need not be typed,

    replacing equality of tableaux by equivalence. A consequence is that a chasecan terminate unsuccessfully when it tries to identify two distinct constants.New results include decidability of subconsistency (Theorem 3.14), which willbe needed for determining fixes.

    Definition 3.8. We introduce an artificial top element, denoted , to thequasi-order T, . Let F = andG be tableaux, and a set of full dependen-cies. We write F G ifG can be obtained from Fby a single application ofone of the followingchase rules:

    (1) If = (B,h) is an ftgd of , then F F (F).

    (2) Let (B, p = q) be an fegd of, and a substitution such that (B) F.If(p) and (q) are two distinct constants, then F ; otherwise, F id(p) = (q)(F).

    A chase of F by is a maximal (with respect to length) sequence F =F0,F1, . . . ,Fn of tableaux such that for every i {1, . . . , n}, Fi1 Fi andFi1 = Fi.

    Requiring that chases be maximal tacitly assumes that chases are finite,which is expressed by Lemma 3.9. This lemma and some of the lemmas tofollow imitate some known results (see for example Lemmas 8.4.3 and 8.4.4 in

    Abiteboul et al. [1995]). Since our semantics are slightly different (distinct vari-ables should not be treated as distinguished constants in general), the resultsare stated explicitly.

    LEMMA 3.9. Let F = be a tableau and a set of full dependencies.

    (1) If G is a tableau in a chase of F by, then G F .(2) Each chase of F by is finite.

    (3) If G = is the last element of a chase of F by , then G |= .

    (4) If G = is the last element of a chase of F by , and is a substitutionmapping distinct variables to distinct constants not occurring elsewhere,

    then(G) |= .

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    11/47

    732 Jef Wijsen

    PROOF. Easy.

    The following two lemmas are needed for the main theorems to follow, express-

    ing that the chase is Church-Rosser up to and that subconsistency is decidableforfull dependencies. Lemma 3.10 implies that all tableaux in a successful chaseofF by are homomorphic to each consistent tableau HsatisfyingH F.

    LEMMA 3.10. Let F , H be tableaux, both distinct from , and a set of fulldependencies. Let G = be a tableau in a chase of F by. If H F and H |= ,

    then HG .

    PROOF. Assume H F and H |= . Let F = F0,F1, . . . ,Fn = G = be the initial sequence of a chase of F by . We prove by induction that foreachk {0, . . . , n}, H Fk . Base: H F = F0 is given. Step: The inductionhypothesis is H Fk1. The tableau Fk can be obtained by an ftgd or an fegd:

    (1) Fk is obtained from Fk1by an application of =(B,h) of ; that is, Fk =

    Fk1 (Fk1). Since H Fk1, H (H) Fk1 (Fk1) by Lemma 3.4.Since H |= , H H (H). It follows H Fk .

    (2) Fk is obtained from Fk1 by an application of = (B, p = q) of. AsG = , Fk = . We can assume a substitution such that (B) Fk1,(p) and (q) are not distinct constants, and Fk = id(p) = (q)(Fk1). SinceH Fk1, we can assume a substitution such that (Fk1) H. It follows((B)) H, and, since H |= , (p) and (q) are not two distinctconstants and H id(p) = (q)(H). By Lemma 3.6, id(p) = (q)(H) id(p) = (q)(Fk1). By transitivity, H Fk .

    It is correct to conclude HG .

    LEMMA 3.11. Let F , H be tableaux, both distinct from , and a set of full

    dependencies. If

    is the last element of a chase of F by and H F, thenH |= .

    PROOF. Let G be the last but one element in the chase ofF by that endswith : G . Assume H F. If H G , then H |= by Lemma 3.10. Nextassume H G, hence we can assume a substitution such that (G) H.Since G , contains an fegd = (B, p = q) such that there exists asubstitution satisfying(B) G and (p) = (q) are distinct constants.Then (B) H, but (p) = (q) are distinct constants. We concludeH |= , hence H|= .

    THEOREM 3.12. Let F = be a tableau and a set of full dependencies. Iftwo chases of F by end with G1and G2 respectively, then G1 G 2.

    PROOF

    . This is the Church-Rosser property. Assume G2 =

    and G1 =

    .SinceG 2 F andG 2 |= by Lemma 3.9, G 2 G 1 by Lemma 3.10. Similarly,G1 G 2. It followsG 1 G 2.

    Next assume G2 = . Then G1 = , or else G1 F and G1 |= byLemma 3.9, but G 1 |= by Lemma 3.11, a contradiction.

    Theorem 3.12 motivates the following definition.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    12/47

    Database Repairing Using Updates 733

    Definition 3.13. Let F = be a tableau and a set of full dependencies.We write chase(F, ) for the equivalence class [G] if the last element of achase ofFby is G . The singleton []

    is also written .

    Given a set of full dependencies, it is decidable whether a given tableau F issubconsistent.

    THEOREM 3.14. Let be a set of full dependencies and F = a tableau.Then, F is subconsistent (i.e. F ) if and only ifchase(F, ) = .

    PROOF. The only-if part is an immediate corollary of Lemma 3.11. The ifpartfollows from Lemma 3.9.

    Since the empty tableau is homomorphic to any relation, is satisfiable ifand only if the empty tableau is subconsistent. Likewise, since the tableau{x1,x2, . . . ,xn}, wherex1,x2, . . . ,xn are distinct constants, is homomorphic toeach nonempty relation, is coherent ( allows a nonempty relation) if and

    only if{x1,x2, . . . ,xn} is subconsistent. So we obtain the following result:COROLLARY 3.15. Assume arity n and let x1,x2, . . . ,xn be distinct constants.

    Let be a set of full dependencies. Then,

    (1) is satisfiable if and only ifchase({}, ) = ;

    (2) is coherent if and only ifchase({x1,x2, . . . ,xn}, ) = .

    Finally, the chase preserves the order on tableaux.

    THEOREM 3.16. Let F , G be tableaux, both distinct from , and a set offull dependencies. If F G, F chase(F, ), and G chase(G, ), thenF G .

    PROOF. AssumeF G . The proof is trivial ifchase(F, ) = . Next assume

    that chase(F, ) is distinct from

    . Assume without loss of generality thatF = is the last element in a chase of F by . By Lemma 3.9, F |= andF F, hence F G . Thenchase(G, )= , or else F |= by Lemma 3.11,a contradiction. Let G = be the last element in a chase of G by . ByLemma 3.10, F G .

    COROLLARY 3.17. Let F , G be tableaux, both distinct from , and a set offull dependencies. If F G, thenchase(F, ) = chase(G, ).

    3.3 Replacing Uprepairs by Chase Results

    After our digression on the chase, we return to the problem raised at the be-ginning of Section 3: given a satisfiable set of full dependencies, a query ,and a finite set F of subconsistent tableaux, compute (F). Recall that(F)denotes the intersection of the answers to on all (possibly infinitely many)

    uprepairs generated by F and . In the results to follow, F can be any set ofsubconsistent tableaux. Later on in Section 5, Fwill be restricted to be a set offixes. Theorem 3.22 implies that in order to compute(F), it suffices to querythe tableaux obtained by chasing each F Fby .

    For example, let F = {F1, . . . ,F5}, the set of tableaux shown in Figure 2, andlet = {1, 1, 2, 3}, the full dependencies in Figure 1. A chase ofFi by ends

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    13/47

    734 Jef Wijsen

    withTi (1 i 5). Then, Theorem 3.22 will tell us that for any tableau query,(F) is equal togrd((T1)) grd((T5)).

    The following shorthand notation will be used for convenience.

    Definition 3.18. LetGbe a tableau and F a set of tableaux. We write F

    G

    if and only ifF G for each F F.

    LEMMA 3.19. Let be a satisfiable set of full dependencies and F a subcon-

    sistent tableau. For each tableau H, F

    H if and only ifchase(F, )

    H.

    PROOF. Note that since F , chase(F, ) = by Theorem 3.14. If part.Let G = be the last element in a chase of F by . Assume G H andJ F. It suffices to show J H. Since J F and J |= , J G by

    Lemma 3.10. By transitivity, J H. Only-if part. Assume chase(F, )

    H.

    Let G be the last element in a chase of F by . From chase(F, )

    H, it

    follows G H. Let Vbe the set of variables occurring in G. Let : V

    inj

    dom be a substitution that maps distinct variables to distinct constants notin asd(G) asd(H). By Lemma 3.9, (G) |= . Assume (G) H. Then, wecan assume a substitution such that (H) (G), hence 1 (H) G . ItfollowsG H, a contradiction. We conclude by contradiction(G) H. From(G) G and G F(Lemma 3.9), (G) Fby transitivity. We can assume aminimal (with respect to ) relation Jsatisfying(G) J F and J |= .

    Then J F and J H. Hence, F

    H.

    Since all tableaux in chase(F, ) are equivalent, in order to verify chase(F,

    )

    Hin the preceding lemma (and the following theorem), it suffices to chaseFby and check whether His homomorphic to the final tableau in the chase.

    THEOREM 3.20. Let be a satisfiable set of full dependencies andF a finite

    set of subconsistent tableaux. For each tableau H, F

    H if and only if for

    each F F,chase(F, )

    H.

    PROOF. This follows immediately from Lemma 3.19.

    LEMMA 3.21. Let be a tableau query. Let be a satisfiable set of fulldependencies. Let F be a subconsistent tableau. For each G chase(F, ),(F) = grd((G)).

    PROOF. Let = (B,h). Since F ,chase(F, ) = by Theorem 3.14. LetG chase(F, ).

    To prove the inclusion grd((G)) (F), assume t grd((G)). Let J F. Since J F and J |= , J G by Lemma 3.10. By Lemma 3.4, (J)

    (G). Since tis a ground tuple in (G), it follows t (J). SinceJ is an arbitraryelement ofF, it followst (F).

    To prove the inclusion (F) grd((G)), let t (F). Let be a sub-stitution for the symbols in h such that (h) = t, extended to be the identity

    otherwise. Obviously, t (F) impliesF

    (B).By Lemma 3.19, G (B).Since(h) is ground, (h) = t grd((G)).

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    14/47

    Database Repairing Using Updates 735

    THEOREM 3.22. Let be a tableau query. Let be a satisfiable set of fulldependencies and F a finite set of subconsistent tableaux. Let G be a set of

    tableaux such that G FF

    chase(F, ). Then, (F) =GG

    grd((G)).

    PROOF. This follows immediately from Lemma 3.21.

    In practice, the set G in the preceding theorem will be computed by collectingthe terminal tableaux that result from chasing each tableau ofF by .

    4. CONSTRUCTING THE NUCLEUS

    Given a satisfiable set of full dependencies, a tableau query , and a setFof subconsistent tableaux, Theorem 3.22 gives us an effective algorithm tocompute(F): chase each tableau ofF by , query the last tableau of eachchase by, and return the ground tuples common to all query answers. We nowshow thatF has a nucleus: that there exists asingletableauG such that for

    everytableau query,(F) = grd((G)).

    4.1 Nucleus Definition

    Anucleus ofaset I of relations is defined relative to a class of queries; intuitively,it is a single tableau that can replace I for the purpose of query answering.

    Definition 4.1. Let I be a (possibly infinite) set of relations. Let Q be asubclass ofCQ (Q CQ). The tableauFiscalleda Q-nucleus ofI ifandonlyifforevery query Q, grd((F)) = (I). The termCQ-nucleus is often abbreviatedasnucleusCQcan be omitted.

    In the preceding definition, the function grd() serves to eliminate from (F)tuples that contain variables. The following theorem implies that a nucleusmust be unique up to .

    THEOREM 4.2. Let I be a set of relations. If G 1 and G2 are CQ-nuclei of I,then G1 G 2.

    PROOF. Let G1and G2be CQ-nuclei ofI. Consider the Boolean tableau query = (G1, 1). Since (G1) = {1}, we have (I) = {1}, hence (G2) = {1}.Hence, there exists a substitutionthat maps the body ofinto G2: (G1) G 2.It followsG 2 G 1. By the same reasoning, G 1 G 2. Hence,G 1 G 2.

    4.2 Infimum

    We define the greatest lower bound (or infimum) of a set of tableaux.

    Definition 4.3. LetF be a finite set of tableaux. An infimum(with respectto ) ofF is a tableauG satisfying:

    (1) F G , and

    (2) for every tableauG ,F

    G impliesG G .

    It is easyto see that all infimums of a set F of tableaux must be mutually equiv-alent. We assume there is an arbitrary selection rule that picks a representativeof this -equivalence class and denote it inf F.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    15/47

    736 Jef Wijsen

    There is an effective procedure to construct an infimum of any finite set oftableaux; the construction is borrowed from infimums of sets of clauses [Plotkin1969]. Practically, to construct an infimum of{F

    1,F

    2}, every tuple ofF

    1is com-

    bined columnwise with every tuple of F2. The combination of two symbols pandq, denoted(p, q), is a variable unless pandq are the same constant. Thecombination of two times the same constant (say a) yields that same constant:(a, a) = a.

    Definition 4.4. An inf-mapping is a partial one-one function : (dom

    var) (dom var) inj dom var such that (p, q) = p if p and q are the

    same constant, and(p, q) is a new distinct variable otherwise.Let F, G be tableaux and an inf-mapping defined over asd(F) asd(G).

    The inf-mappingnaturally extends to pairs of tuples and pairs of tableaux asfollows: firstly, ift Fandt G , then

    (t, t ) := (t(1), t (1)), . . . , (t(n), t (n));

    secondly,(F,G) := {(t, t) | t F, t G }.

    For example, for the following tableaux F1and F2,

    F1 1 2 3a b z

    b c z

    F2 1 2 3a y y

    y c y,

    we obtain:

    (F1,F2) 1 2 3(a, a) (b, y) (z, y )(b, a) (c, y ) (z, y )(a, y ) (b, c) (z, y )(b, y) (c, c) (z, y )

    1 2 3a u v

    w1 w2 v

    w3 w4 v

    u c v

    G 1 2 3a u v

    u c v

    It is easy to verify thatG is homomorphic to bothF1and F2. It takes more effortto show that every tableau homomorphic to bothF1andF2is also homomorphictoG . We show that the above construction is correct in general.

    THEOREM 4.5. Let F1, F2 be tableaux. If is an inf-mapping defined overasd(F1) asd(F2), then(F1,F2)is an infimum of{F1,F2}.

    PROOF. Let G = (F1,F2).Let and be substitutions for the symbols in therange of such that for all symbols (p, q) asd(F1) asd(F2),((p, q)) = pand ((p, q)) = q. The mappings and are the identity on constants because(p, q) = a dom implies p = q = a; they are well-defined because(p1, q1) =(p2, q2) implies p1 = p2 and q1 = q2. Clearly, (G) = F1 and (G) = F2. It

    followsF1 G and F2 G , hence {F1,F2}

    G .

    Assume a tableauHhomomorphic to bothF1andF2, {F1, F2}

    H; it sufficesto showG H. We can assume substitutionsand such that(H) F1and(H) F2. Let be a total function with domain asd(H) defined by (p) =((p), (p)). Note that is the identity on constants. Let t H. It can beeasily verified that (t) = ((t), (t)). From (t) F1 and (t) F2, it follows(t) G . Sincet is an arbitrary tuple ofH,(H) G . Hence, G H.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    16/47

    Database Repairing Using Updates 737

    The following theorem expresses that the infimum of a set of tableaux preservesconsistency.

    THEOREM 4.6. Let F1, F2be tableaux and a set of full dependencies. Let Gbe an infimum of {F1,F2}. If F1 |= and F2 |= , then G |= .

    PROOF. Assume F1 |= and F2 |= . Let =(B,h) be an ftgd in . SinceF1 |= , F1 F1 (F1). From F1 G, it follows F1 (F1) G (G) by

    Lemma 3.4. Hence, F1 G (G). Likewise, F2 G (G). Hence, {F1,F2}

    G (G). It follows G G (G), henceG |= .Next, let = (B, p = q) be an fegd in . Let be an inf-mapping defined over

    asd(F1) asd(F2). By Theorem 4.5, G := (F1,F2) is an infimum of{F1,F2}.

    Let and be substitutions for the symbols in the range of such that for allsymbols (m, o) asd(F1) asd(F2),((m, o)) = mand ((m, o)) = o. Clearly,(G) = F1 and (G

    ) = F2. Let be any substitution with (B) G. It fol-

    lows (B) F1 and (B) F2. To ease the notation, define p := (p)

    and q

    := (q). Since F1 |= , it follows (p

    ), (q

    ) are not two distinct con-stants; likewise, (p), (q) are not two distinct constants. Since and arethe identity on constants, p, q cannot be two distinct constants either. LetF1 := id(p) = (q)(F1) and F

    2 := id(p) = (q)(F2). Since F1 |= and F2 |= ,

    F1 F1and F2 F

    2. Let H := (F

    1,F

    2). Let be a substitution for the vari-

    ables in the range of defined by ((x, y )) = (id(p) = (q)(x), id(p) = (q)(y)).Since (p) = (q) is now easily verifiable, it follows = idp = q . We show(idp = q (G

    )) H. To this extent, let r idp = q (G). Hence, there exist

    t F1 and s F2 such that r = idp = q ((t,s)). Since t F1 and s F2,(id(p) = (q)(t), id(p) = (q)(s)) H. From (id(p) = (q)(t), id(p) = (q )(s)) =((t,s)) = idp = q ((t,s)) = (r), it follows (r) H. Since r is an ar-bitrary tuple of idp = q (G), (idp = q (G )) H, hence H idp = q (G). ByTheorem 4.5, H is an infimum of{F1,F

    2}. Since F1 F

    1 and F2 F

    2, H is

    an infimum of{F1,F2}. It follows G

    H idp = q (G

    ), hence G

    |= . SinceG G ,G |= by Theorem 3.7. It is correct to concludeG |= .

    COROLLARY 4.7. Let a set of full dependencies. Every infimum of a finiteset of consistent tableaux is itself consistent.

    4.3 Commutative Diagram

    The construction of nuclei relies on Theorem 4.8, which states that the followingcommutative diagram is correct:

    THEOREM 4.8. Let F1,F2 be tableaux and =(B,h)a tableau query. Let Gbe an infimum of{F1,F2}, and H an infimum of{(F1), (F2)}. Then(G) H.

    PROOF. Obviously, asd((F1)) asd(F1) asd() and asd((F2)) asd(F2) asd(). Let be an inf-mapping defined over (asd(F1) asd())

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    17/47

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    18/47

    Database Repairing Using Updates 739

    5. FIXES AND CONSISTENT QUERY ANSWERING

    In the preceding lemmas and theorems, we start from any finite set F of subcon-

    sistent tableaux. We will now apply these results in situations where F resultsfrom fixing a relation I. We also provide upper bounds for the complexity offix checking and consistent query answering.

    5.1 Fixes

    Intuitively, a fix of Iis a subconsistent tableau obtained from I by replacingerroneous values by distinct variables. The formalization relies on homomor-phisms that do not identify distinct tuples.

    Definition 5.1. Let F,G be tableaux. Aone-one homomorphismfrom F toGis a substitution for the variables in Fsuch that(F) G and identifiesno two distinct tuples of Ffor all t , t F, t = t implies(t) = (t). If sucha one-one homomorphism from F to G exists, then F is said to be one-one

    homomorphicto G , denotedG F. We write F G if and only if F G andG F. We writeG Fif and only ifG F andG F.

    We illustrate the difference between and by our running example (seeFigures 1 and 2). The tableau F1 F2 is subconsistent and homomorphic toCARS: CARS F1 F2 (use the substitution {x/sedan,z/upper}). However,CARS F1 F2 because every homomorphism from F1 F2 to CARS mustidentify the first and the third tuple ofF1 F2. We require fixes to be one-onehomomorphic to the relation that has to be repaired, in order to avoid doublerepairing of the same tuple.

    F1 F2 Model Version PriceRangex luxury upper

    sedan standard lower

    sedan luxury z

    The following properties are straightforward.

    LEMMA 5.2. Let F,G T.

    (1) If F G, then F G. If F G, then F G .

    (2) T, is a quasi-order.

    (3) is an equivalence relation onT.

    PROOF. Easy.

    Unlike earlier work [Wijsen 2002], we disallow multiple occurrences of thesame variable in a fix, in order to exclude some unnatural uprepairs illustratedby Figure 4. The relation EMP stores employee salaries, and the fegd 5 ex-

    presses that no employee has more than one salary. Clearly, EMP |= 5. We haveEMP F and Fsubconsistent. Note incidentally that F is not homomorphicto the tableau obtained from F by substituting 195 for the second occurrenceof y . The relation Jis the uprepair generated by F and 5. This uprepair iscounterintuitive, however, as there seems to be no reason to assume that Anssalary is mistaken inEMP. Intuitively, the fact that Ed and An earn the same

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    19/47

    740 Jef Wijsen

    Fig. 4.

    salary in EMP is just a coincident not to be encoded in fixes, as is done in Fthrough the double occurrence of y .

    Definition 5.3. A tuple or tableau without multiple occurrences of the samevariable is calledlinear.

    Let be a satisfiable set of constraints and Ia relation. Afix ofI and isa linear tableau Fsatisfying:

    (1) I Fand Fis subconsistent;

    (2) Maximality: for every linear tableau G, if I G F, then G is not

    subconsistent.

    LetF be the set of fixes ofIand . We define:

    I := F,

    the set of uprepairs generated by the fixes ofIand. IfJ I, then we alsosay that Jis an uprepair generated by I and . If is a tableau query, then(I) is called theconsistent query answerto on input I(relative to ).

    Given a linear tableau F, it is generally sufficient to know the entries of Fup to a renaming of variables; if this is the case, then every variable can bereplaced by a placeholder without loss of information. So each occurrence of in a tableau stands for a distinct variable.

    Intuitively, the use ofin I

    indicates that in repairing a relation I, wefirst go down the homomorphism lattice to find fixes, then go up to find upre-pairs. Recall that (I) is the intersection of the answers to on each uprepairgenerated by I and . If is a satisfiable set of constraints, then the emptytableau is subconsistent, so the set of fixes is nonempty. Since the cardinality ofeach fix ofI is bounded by |I|, there can be at most finitely many nonequivalentfixes. This does not mean, however, that determining fixes is feasible in general.Obviously, the singleton tableau {x1,x2, . . . ,xn} is subconsistent (with respectto ) if and only if has a nonempty model ( is coherent). Since the latterproperty is generally undecidable, there is no algorithm to determine whether agiven tableauFis subconsistent. When we limit ourselves to full dependencies,we obtain the following result.

    COROLLARY

    5.4. For every relation I and satisfiable set

    of full dependen-

    cies, there exists a computable CQ-nucleus of I.

    PROOF. Let F be the set of all fixes of I and , hence I = F. Sincethe number of nonequivalent fixes is finite, we can compute a finite set F suchthatF F. By Lemma 2.6, F =F. Since I =F andF is finite, thedesired result follows from Theorem 4.10.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    20/47

    Database Repairing Using Updates 741

    The introductory example of Section 1 can illustrate the nucleus computation.The tableaux F1F5are five fixes ofCARSand = {1, 1, 2, 3}, and all otherfixes are equivalent to one of F

    1F

    5. The chases of these fixes result in T

    1T

    5.

    The infimumG of{T1, . . . ,T5}, which is shown in Section 1, is a CQ-nucleus ofI. Note incidentally that this nucleus is not linear.

    5.2 Fix Checking

    Let Ldenote the set of linear tableaux, and R the set of relations (all of fixedarityn). Let be a satisfiable set of constraints. Fix checking is the complexityof the set:

    FC() := {(I,F) R L | Fis a fix ofIand}.

    We will show that fix checking is in P for full dependencies. We need the con-struct of transformation, which allows modifying tableaux by specifying thenew symbol to be put at each entry. Transformations differ from substitutions

    in that two occurrences of the same symbol in a tableau need not be transformedinto the same symbol. Also, constants may be transformed into variables.

    Definition 5.5. Let F be a tableau of arity n. Each element of F {1,2, . . . , n} is called an entry of F (or F-entry). A transformation of F is atotal function

    : F {1, 2, . . . , n} dom var

    A transformation ofFis applied to Fin a natural way as follows. For eacht F:

    (t) := (t, 1), (t, 2), . . . , (t, n).

    Finally,

    (F) := {(t) | t F}.A special transformation of a relationIresults from listing the entries ofIthathave to be replaced by new distinct variables.

    Definition 5.6. LetIbe a relation ofarity n. Every setEofI-entries inducesa transformation of I, denotedE : I {1, 2, . . . , n} dom var, defined asfollows: for each t I, E (t, i) = t(i) if (t, i) E; otherwise E (t, i) is a newdistinct variable not occurring elsewhere.

    For example, let I = {s, t}be as shown next. Let E = {(s, 2 ) , (t, 1 ) , (t, 2)}. SinceE (I) is obviously a linear tableau, variables can be shown as by our nota-tional convention.

    I 1 2 3

    a b c (s)b c d (t)

    E (I) 1 2 3

    a c d

    Clearly, for every relation I and set E of I-entries, we have {}(I) = I, I E (I), and |I| = |E (I)|. The following lemma states that all fixes of a relationIand a coherent set of constraints (has a nonempty model) can be obtainedby replacing certain occurrences of constants in Iby distinct variables.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    21/47

    742 Jef Wijsen

    LEMMA 5.7. Assume arity n. Let be a coherent set of constraints. Let I bea relation and F a linear tableau.

    (1) If F is a fix of I and , then |F| = |I|.(2) For every fix F of I and , there exists a set E of I -entries such that F

    E (I).

    PROOF. For (1), let Fbe a fix ofIand . Clearly, |F| |I|, or else I F, acontradiction. Next assume |F|< |I|. Since is coherent,{} is not a fix, hence|F| 1. Let t = x1,x2 . . . ,xn where x1, . . . ,xn are new distinct variables notoccurring elsewhere. From F F {t}and Fsubconsistent, it follows F {t}subconsistent. Since I F {t} Fis obvious, Fis not a fix, a contradiction.We conclude by contradiction that |F| = |I|.

    For (2), let Fbe a fix ofIand . Since I F and|F| = |I|, we can assumea substitution such that (F) = I. Let E be the set of I-entries such that Econtains ((t), i) ift F andt (i) var. Obviously,E (I) F.

    LEMMA 5.8. Let I be a relation and F a linear tableau such that|I| = |F|.It can be decided in polynomial time whether I F .

    PROOF. Define S := {(s, t) I F | {s} {t}}. Clearly, I F if and onlyif there exists T S such that |T| = |I| and no two distinct pairs ofTagreeon the first or the second coordinate. The latter problem is a two-dimensionalmatching problem, known to be in P.

    THEOREM 5.9. For a satisfiable set of full dependencies, FC()is inP.

    PROOF. Let Ibe a relation and Fa linear tableau. If is not coherent, then(I,F) FC() if and only ifF = {}. If is coherent, then (I,F) FC() if andonly if the following three conditions are simultaneously satisfied:

    (1) |I| = |F| and I F. This can be tested in polynomial time by Lemma 5.8.(2) F is subconsistent. By Theorem 3.14, this can be tested by the chase, which

    is known to have polynomial time data complexity [Beeri and Vardi 1984].

    (3) Ifx is a variable occurring in the ith column of F, anda a constant occur-ring in the i th column of I, then either I idx = a (F) or idx = a (F) is notsubconsistent. For arityn, there are at mostn|I|2 possible substitutions ofthe formidx = a .

    We show that the condition (3) is sufficient to guarantee maximality (with re-spect to ) ofF. Assume that Fsatisfies conditions (1) and (2), but Fis not afix. Then, there is a linear tableau G such that I G F and G is subcon-sistent. We can assume a substitution such that (F) = G and |(F)| = |F|.Since G is linear, cannot map distinct variables of F to the same variable

    in G. So can only rename variables or replace variables by constants. SinceG F, for some variable x in F, for some constant a, (x) = a. Obviously,I G idx = a (F) F, and since G is subconsistent,idx = a (F) is subconsis-tent. So we showed that ifFsatisfies conditions (1) and (2), but Fis not a fix,then for some variablex in F, for some constant a,I idx = a (F) and idx = a (F)subconsistent.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    22/47

    Database Repairing Using Updates 743

    5.3 Consistent Query Answering

    The problem of consistent query answering is defined relative to a fixed set

    of constraints and a fixed tableau query =(B,h), where B and all tuples inhave the same fixed arity n. The problem is to decide membership of the set(recall that R is the set of all relations):

    CQA(, ) := {I R | (I) = {} }.

    That is, on input of a relation Iof fixed arity, we have to decide whether theintersection of the answers to on all uprepairs is empty. A difference withthe problem called consistent query answers in Chomicki and Marcinkowski[2005a], is that our definition is not restricted to Boolean queries. We showthat consistent query answering is in NP for full dependencies and tableauqueries.

    THEOREM 5.10. For a satisfiable set of full dependencies and a Boolean

    tableau query,CQA(, )is inNP.PROOF. Let Ibe a relation of arityn. If is not coherent, then I = {{}},

    so it suffices to verify whether ({}) = {}. Next assume coherent. Since isBoolean, it follows from Lemma 2.6 and Theorem 3.22 that (I) = {}if andonly if for some fix FofIand , we have (G) = {} whereG is the last tableauof a chase ofFby . From Lemma 5.7, it follows that every fix to consider canbe encoded as a set E ofI-entries, whose size is bounded by n|I|.

    A nondeterministic polynomial time algorithm can guess some set E ofI-entries, check in polynomial time whether E (I) is a fix (Theorem 5.9), andif so, compute the last tableau (call it G) of a chase ofE (I) by , and verifywhether(G) = {}.

    COROLLARY 5.11. For a satisfiable set of full dependencies and a tableau

    query,CQA(, )is inNP.

    PROOF. AssumeIa relation of arity n.Let lbethe length of.Let = (B,h).Assume (I) = {}; we can assume a ground tuple t such that t (I).Clearly, all constants that occur in t but not inhmust come from Ior . LetVbe the set of variables that occur in h,and C the set of constants that occur inIor. Then, for some substitution :V C, extended to be the identity otherwise,(h) = t. Let = ((B), (h)), a Boolean tableau query. Clearly, t (I

    )ifandonly ift (I

    ). There are at most (n|I| + l )|V| possibilities for , a numberthat is polynomial in |I|. Consequently, to decide whether I CQA(, ), itsuffices to decide I CQA(, ) for polynomially many Boolean queries .The desired result then follows from Theorem 5.10.

    6. NUCLEUS FOR UPDATE-BASED REPAIRINGLet be a satisfiable set of full dependencies. On input of a relation I, we caneffectively compute a CQ-nucleus of I (Corollary 5.4). This nucleus allowsus to compute consistent answers to any tableau query without query rewrit-ing. However, the nucleus may not be practical because its construction takesexponential time, or even worse, its size is exponential. We show next that the

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    23/47

    744 Jef Wijsen

    Fig. 5. Graphical representation of Pk,n, P k,k , and O m,n k > 1, m 1.

    worst scenario cannot be ruled out. Clearly, if the complexity ofCQA(, ) isNP-complete for some tableau query , then unless P = NP, the constructionof aCQ-nucleus takes exponential time. In Sections 6.2 and 6.3, we introducerestricted query classes that allow nuclei of polynomial size, and we providesufficient conditions under which these nuclei can be computed in polynomialtime. These conditions will be used for identifying practical cases in Sections 7

    and 8.

    6.1 Nucleus of Exponential Size

    We show that for full dependencies and conjunctive queries, the size of nucleimay not be polynomially bounded (Theorem 6.6).

    Definition 6.1. Forn k > 1, define Pk,n as the following relation:

    Pk,n := {b, b, 0,k 1} {b, b, i + 1, i | 0 i n 2}.

    Form 1, define O m as the following tableau:

    Om := {b, b,x0,xm1} {b, b,xi+1,xi | 0 i m 2}.

    Since the constructs are technical, we give a graphical interpretation. Ignoring

    the first and the second column, which are filled by the constant b, the thirdand the fourth column can be represented as a graph. In particular, a tupleb, b, , encodes an edge from node to node . The graph representationsofPk,n, Pk,k and O m are shown in Figure 5.

    Obviously, ifm = k, then O m is homomorphic to Pk,n: simply use the substi-tutiondefined as(xj ) = j , j {0, 1, . . . , m 1}. The following lemma states

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    24/47

    Database Repairing Using Updates 745

    Fig. 6. Graphical representation ofidxk = x l (Om), illustrating Lemma 6.3.

    that a necessary and sufficient condition for Om to be homomorphic to Pk,n isthatm is a multiple ofk .

    LEMMA 6.2. Let n k >1 and m 1. Then Pk,n O m if and only if m is amultiple of k.

    PROOF. It can be easily verified thatO m = {b, b,x(j +1) mod m ,xj modm | j N0}, where N0 = {0, 1, 2, . . . }. If part. Assume m is a multiple of k. Let bea substitution such that for each j {0, 1, . . . , m 1}, (xj ) = j mod k . Lett = b, b,x(i+1) mod m ,xi modm be an arbitrary tuple ofO

    m (i N0). Then

    (t) = b, b, ((i + 1) mod m)) mod k , (i modm) mod k .

    Sincem is a multiple ofk ,

    (t) = b, b, (i + 1) mod k , i modk Pk,n.

    It is correct to conclude(Om) Pk,n, hence Pk,n O m.Only if part. Assume Pk,n Om. Let be a substitution for x0,x1, . . . ,xm1such that (Om) Pk,n. Assume for some j {0,1, . . . , m 1}, (xj ) =l k .Then necessarily(x(j +1) mod m ) = l + 1,(x(j +2) mod m ) = l + 2,. . . Eventually,(xp modm ) = n 1 for some p N0. Let t = b, b,x(p+1) mod m ,xp modm Om. Since (xp modm ) =n 1 does not occur in the rightmost column of P

    k,n,(t) Pk,n, a contradiction. We conclude by contradiction that for each j {0,1, . . . , m 1}, (xj ) < k. That is, (Om) Pk,k. Then is essentially a

    homomorphism from a cycle of length m to a cycle of length k. By the mainresult in Hell and Zhu [1995], it follows that m is a multiple ofk .

    The following lemma states that identifying two distinct variables in Om

    creates a cycle of smaller length. The graphical representation is shown inFigure 6.

    LEMMA 6.3. Letm> l >k 0. There exists an integer p suchthatm > p 1and idxk = xl (O

    m) Op.

    PROOF. Let p = l k . Clearly, m > p 1. Let be a substitution forx0,x1, . . . ,xp1 such that (xj ) = xj +k (j {0,1, . . . , p 1}). Then idxk = xl (Op) idxk = xl (O

    m), henceidxk = xl (Om) O p.

    In broad outline, the remainder goes as follows. We choose a databaseDn of sizen > 1 and a set of full dependencies such that, first, each uprepair contains

    Pk,n for some k with n k > 1 (Lemma 6.5, first item), and second, for eachk, there is an uprepair that contains exactly Pk,n (Lemma 6.5, second item).We know that a nucleus of (Dn) exists (Corollary 5.4), and that this nucleusis unique up to (Theorem 4.2). We then use Lemma 6.2 to show that somenucleus contains the tableau Om withm a multiple of each positive integer ksatisfyingn k, and we rely on Lemma 6.3 to show that this nucleus is reduced.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    25/47

    746 Jef Wijsen

    The desired result then follows from the fact that the least common multiple ofthe firstn positive integers is greater than 2n ifn 7.

    Definition 6.4. For eachn > 1, define Dn as follows:

    Dn 1 2 3 4a 1 n n 1 n 2a 2 n n 2 n 3

    ...a 2 2 1a 1 1 0

    The second column contains negative values only; the third and thefourth columns contain nonnegative values only. Consider the following fulldependencies:

    6 1 2 3 4x y u1 v1x z u2 v2

    y =z

    3 1 2 3 4a u x v

    b b 0 x

    4 1 2 3 4u v x y

    b b x y

    LEMMA 6.5. Let = {6, 3, 4} and n> 1.

    (1) If F is a fix of D n and , and F chase(F, ), then for some k withn k >1, F Pk,n.

    (2) For every k with n k > 1, there exists a fix F of Dn and such that foreach F chase(F, ), Pk,n = {t F | t(1) = t(2) = b}.

    PROOF. It is easy to see that each fix is of the form F shown below, forsome permutationi1, i2, . . . , in1of1,2, . . . , n 1, and for somel such that

    1 l n 1. A chase ofF by ends with F.

    F 1 2 3 4 in1 in1 in1 1

    ... il +2 il +2 il +2 1 il +1 il +1 il +1 1a il il 1

    ...

    a i3 i3 1a i2 i2 1a i1 i1 i1 1

    F 1 2 3 4 in1 in1 in1 1

    ... il +1 il +1 il +1 1a i1 il il 1

    ...a i1 i1 i1 1b b n 1 n 2

    ...

    b b 2 1b b 1 0b b 0 il

    ...b b 0 i2b b 0 i1

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    26/47

    Database Repairing Using Updates 747

    The first part of the proof follows becausel 1. The second part is obtained bychoosingi1 = k 1 andl = 1.

    THEOREM 6.6. Let = {6, 3, 4} and n 7. Every CQ-nucleus of(Dn)

    contains at least2n tuples.

    PROOF. Let G be a CQ-nucleus of (Dn). By Theorems 4.10 and 4.2, Gis-equivalent to an infimum of the set of terminal tableaux that result fromchasing each fix ofDn andby .Let m = lcm(2, 3, . . . , n), where lcm() denotesthe least common multiple.

    We first show G Om. Let Fbe a fix of Dn and . Let F chase(F, ).By Lemma 6.5 (first item), F Pk,n for some k with n k > 1. Since m isa multiple of k, Pk,n Om by Lemma 6.2. Hence, F Om. Since F is anarbitrary fix, it follows G Om.

    Since G Om, we can assume the existence of a substitution such that(Om) G. We show that is injective. Assume identifies two variables of

    Om, say xk and xl with m > l > k 0. Since = idxk = xl is obvious,(idxk = xl (O

    m)) G , soG idxk = x l (Om). By Lemma 6.3, we can assume the

    existence of some pwithm > p 1 such that G Op. Letk {2,3, . . . , n}. ByLemma 6.5 (second item), we can assume the existence of a fix Fsuch that achase of F by results in a tableau F with{t F | t(1) = t (2) = b} = Pk,n.SinceF G and G Op, it followsF Op. Hence, there exists a substitutionsuch that(Op) F. Moreover, sincet (1)= t (2)= b for every tuplet O p,it follows (Op) Pk,n, hence Pk,n Op. By Lemma 6.2, pis a multiple ofk .Sincekis an arbitrary element of{2, 3, . . . , n},p isa multipleof 2, 3, . . . , n. Sincep< m,m = lcm(2, 3, . . . , n), a contradiction. We conclude by contradiction thatis injective, hence m = |Om| = |(Om)| |G|. Since it is known [Nair 1982a;Nair 1982b] that lcm(1, 2, 3, . . . , n) 2n ifn 7, |G| 2n. This concludes theproof.

    6.2 Linear Tableau Queries

    We now turn to query classes that allow nuclei of polynomial size, and we givesufficient conditions under which these nuclei can be computed in polynomialtime. The query classes are obtained fromCQ by limiting the number of occur-rences of quantified variables.

    Definition 6.7. A tableau query = (B,h) is said to belinearif every vari-able that occurs more than once inBalso occurs in h. The class of linear tableauqueries is denotedlinCQ.

    Note that (B,h) linear does not imply B linear. For example, the following

    tableau query asking for models that exist in both upper and bottom pricedversions, is linear:

    Model Version PriceRangex y upperx z bottomx

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    27/47

    748 Jef Wijsen

    We now show that for any relation Iand satisfiable set of full dependen-cies,I has alinCQ-nucleus the size of which is polynomially bounded in |I|.The proof relies on linearizations of tableaux, which are transformations thatcopy constants through and replace each variable occurrence by a new distinctvariable.

    Definition 6.8. LetFbe a tableau of arity n. Alinearization ofFis a trans-formation lin ofFsuch that lin(t, j ) = t (j ) ift (j ) dom; otherwise lin(t, j )is a new distinct variable not occurring elsewhere (t F, j {1, 2, . . . , n}).

    Obviously, iflin is a linearization of F, then lin(F) is a linear tableau. Forexample,

    F 1 2a x

    b x

    x x

    lin(F) 1 2a y

    b z

    v w

    lin(F) 1 2a

    b

    The linearization lin used is defined by lin(a,x, 1) = a, lin(a,x, 2) = y ,lin(b,x, 1) = b, lin(b,x , 2) = z, lin(x,x, 1) = v, and lin(x,x, 2) = w.Since lin(F) is linear, variables can be replaced by the placeholder withoutloss of information.

    Note that since no variable occurs more than once in a linear tableau G , anequivalent reduced tableau can be computed in quadratic time. The computa-tion starts fromG and repeatedly removes some tuplet if there is a tuplesleftsuch that {s} {t}. In the above example, lin(F) can be equivalently reducedby deleting its last tuple.

    We show that the ground tuples in the answer to a linear tableau query arenot affected by linearization.

    LEMMA 6.9. Let F be a tableau and lin a linearization of F . For every lineartableau query ,grd((F)) = grd((lin(F))).

    PROOF. Let = (B,h) be a linear tableau query. It suffices to showgrd((F)) (lin(F)) andgrd((lin(F))) (F).

    SinceF lin(F) is obvious, (F) (lin(F)) by Lemma 3.4. By Lemma 2.4,grd((lin(F))) (F).

    For the inclusion grd((F)) (lin(F)), assumet grd((F)). Hence, thereexists a substitutionsuch that (B) F and(h) = t. Let be a substitutionfor the variables ofBsuch that for every tuple s B, (s)= lin((s));(s(i))=lin((s), i) for eachi {1, 2, . . . , n}.

    To prove that is well-defined, we have two show two things: (i) that is theidentity on constants, and (ii) that is never required to map two occurrencesof the same variable to distinct symbols. Proving (i) is straightforward; for (ii),

    let t1, t2 be (not necessarily distinct) tuples of B and let i, j {1, . . . , n} suchthat t1(i) = t2(j ) var, and either i = j or t1 = t2. Assume t1(i) = t2(j ) = xwithout loss of generality. That is, the same variable x occurs more than oncein B. The definition of imposes (x) = lin((t1), i) and (x) =

    lin((t2), j ),so we must show lin((t1), i) = lin((t2), j ). Since is linear, x occurs in h.Then(x) must be a constant, or else(h) would not be ground, a contradiction.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    28/47

    Database Repairing Using Updates 749

    Since (x) is a constant and the linearization lin copies constants through,lin((t1), i) = lin((t2), j ) = (x). It is correct to conclude that is well-defined.

    We show that (h) = (h): (h(i)) = (h(i)) for eachi {1, . . . , n}. Two casescan occur:

    h(i) is a constant. Since and are the identity on constants, (h(i)) =(h(i)) =h(i).

    h(i) is variable. Since h(i) must occur in B, we can assume s B and j {1, . . . , n} such that s(j ) = h(i). Since (h) is ground, (h(i)) = (s(j )) is aconstant. Since the linearization lin copies constants through, we obtain(h(i)) = (s(j )) = lin((s), j ) = (s(j )) = (h(i)).

    It is correct to conclude(h) = (h) = t.From (B) = lin((B)) and(B) F, it follows(B) lin(F), hence (h) =

    t (lin(F)). Sincet is an arbitrary tuple ofgrd((F)), grd((F)) (lin(F)).This concludes the proof.

    THEOREM 6.10. Let be a satisfiable set of full dependencies. For every re-lation I, there exists a linCQ-nucleus of I that is linear and whose size is

    polynomially bounded in |I|.

    PROOF. By Corollary 5.4, we can assume the existence of aCQ-nucleusG ofI,(I) = grd((G)) for each tableau query. Letlin be a linearization ofG. For any linear tableau query, grd((G))= grd((lin(G))) by Lemma 6.9,hence grd((lin(G))) = (I). It follows thatlin(G) i s a linCQ-nucleus ofI.

    Let Hbe a reduced tableau equivalent to lin(G). By Corollary 3.5, H is alinCQ-nucleus ofI. Every symbol inH is either a constant in asd(I)asd()or a variable, which may be denoted . Let l be the length of. Let n be thearity of I. The number of constants appearing in I and is at most n|I| + l .

    Hence, Hcan contain no more than (n|I| + l + 1)n

    tuples, or else Hwould notbe reduced. This concludes the proof.

    The following corollary is now straightforward.

    COROLLARY 6.11. Leta satisfiable set of full dependencies. If the complexityof the set

    {(I, t) | I R, t is a linear tuple and I

    {t}}

    is inP, then computing alinCQ-nucleus of I and is in polynomial time in |I|.

    6.3 Quantifier-free Tableau Queries

    Quantifier-free tableau queries further restrict linear tableau queries. The fol-

    lowing results correspond to those obtained for linear tableau queries.

    Definition 6.12. A tableau query = (B,h) is said to be quantifier-free ifevery variable that occurs in B also occurs in h. The class of quantifier-freetableau queries is denotedqfCQ.

    Obviously,qfCQ linCQ CQ.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    29/47

    750 Jef Wijsen

    LEMMA 6.13. For every tableau F , for every quantifier-free tableau query,grd((F)) = (grd(F)).

    PROOF. Let =(B,h) be a quantifier-free tableau query. It suffices to showgrd((F)) (grd(F)) and(grd(F)) grd((F)).

    Since F grd(F) is obvious, (F) (grd(F)) by Lemma 3.4. Since(grd(F)) is ground, (grd(F)) grd((F)).

    For the inclusiongrd((F)) (grd(F)), assume an arbitrary ground tuplet grd((F)). We can assume a substitutionsuch that(B) Fand (h) = t.If(B) is not ground, then since (B,h) is quantifier-free, (h) is not ground, acontradiction. We conclude by contradiction that (B) is ground, hence (B) grd(F). If followst (grd(F)).

    THEOREM 6.14. Let be a satisfiable set of full dependencies. For every re-lation I, there exists a qfCQ-nucleus of I that is ground and whose size is

    polynomially bounded in |I|.

    PROOF. Similar to the proof of Theorem 6.10.

    COROLLARY 6.15. Leta satisfiable set of full dependencies. If the complexityof the set

    {(I, t) | I R, t is a ground tuple and I

    {t}}

    is inP, then computing aqfCQ-nucleus of I and is in polynomial time in |I|.

    7. KEY DEPENDENCIES

    In this section and the following one, we show practical cases where the nu-cleus construction (and hence consistent query answering) is tractable. We alsodetermine frontiers of tractability. We first look at relations Iwith a key con-

    straint. For linear tableau queries, the nucleus construction takes O(m logm)time, where m = |I|. Consistent query answering becomes intractable if weomit the linearity restriction.

    The dependency 2 of Figure 1 encodes the key dependency key(Model,Version). It is well-known how to express a key dependency as a set of fegds;the following definition is included for completeness.

    Definition 7.1. Assume arity n. LetK {1, 2, . . . , n}. Letx1, y1,x2, y2, . . . ,xn, yn be distinct variables. Let s = x1, . . . ,xn. Let t be a tuple such that foreveryi {1, 2, . . . , n}, ifi K, then t (i) = xi, else t (i) = yi. We write key(K)for the set of fegds containing ({s, t},xj = yj ) for every j K (1 j n).

    key(K) is called a key dependency, and every i K is called a key attribute.Curly braces will be omitted: we writekey(i1, . . . , ik ) forkey({i1, . . . , ik }).

    Figure 7 illustrates the construction of a linCQ-nucleusG ofIkey(1,2): for everygroupJ Iof tuples that agree on all key attributes,G contains a single tuplet such that t(i) = c ifs(i) = c for each s J; otherwise t(i) is a new distinctvariable not occurring elsewhere (1 i n, c dom). It is clear that for anyrelation I, this construction takes only O(m logm) time, where m = |I|, thetime to sort Iby key values. We show that this construction is correct.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    30/47

    Database Repairing Using Updates 751

    Fig. 7. Gis alinCQ-nucleus ofIkey(1,2) .

    LEMMA 7.2. Assume arity n and let K {1,2, . . . , n}. Let t be a linear tuple.

    Deciding whether Ikey(K)

    {t} is in polynomial time in the cardinality of the

    input relation I.

    PROOF. Let P be the partition of Idefined by the equivalence relation defined as:s s if and only if for each i K, s(i) = s(i) (s,s I). For eachpartition classP P, for each i {1,2, . . . , n}, defineP i := {s(i) | s P }, the setof constants occurring in the ith column ofP . For every tuple r P 1 P 2 P n, letr : P {1,2, . . . , n} dom var denote the transformation such that

    for each s P , for each i {1, 2, . . . , n}, r (s, i) = s(i) ifs(i) = r(i); otherwiser (s, i) is a new distinct variable not occurring elsewhere. It is easy to see thatr (P ) is a fix of P andkey(K), and that a chase ofr (P) bykey(K) ends with{r}, which is ground and hence an uprepair. Hence, for everyr P 1 P n,{r} is an uprepair generated by P and key(K). It can be easily verified that fixescontaining variables in some of the key columns result in uprepairs properlycontaining {r} for some r P 1 P n. Then, P key(K) {t} if and only iffor every i {1, 2, . . . , n}, if t(i) is a constant, then for all s P , s(i) = t(i).Since tuples belonging to different partition classes ofPdisagree on some keyattribute, Ikey(K) {t} if and only if for some P P, P key(K) {t}. It is easyto see that the time complexity for deciding Ikey(K)

    {t} is O(m logm) wherem = |I|.

    THEOREM 7.3. Assume arity n and let K {1, 2, . . . , n}. Constructing alinCQ-nucleus of Ikey(K) is in polynomial time in the cardinality of the inputrelation I.

    PROOF. From Lemma 7.2 and Corollary 6.11.

    This result does not extend to not-necessarily-linear tableau queries, since con-sistent query answering is NP-complete if we omit the linearity requirement.

    THEOREM 7.4. For the Boolean tableau query

    1 2 3x y a

    z y b

    1

    ,

    CQA(key(1), )is NP-complete.

    PROOF. Membership ofNP follows from Theorem 5.10. The proof, inspiredby Theorem 3.3 in Chomicki and Marcinkowski [2005a], is a reduction fromMONOTONE 3SAT. Let =

    mi=1 Ci where each Ci is either a disjunction of

    three negated Boolean variables or a disjunction of three nonnegated Boolean

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    31/47

    752 Jef Wijsen

    variables. Construct a relation Isuch that for each i {1, . . . , n}, for each pthat occurs in Ci, if Ci contains p, then I contains i, p, b, else I containsi, p, a, wherea, b dom. That is, C

    i = p q r gives rise to the tuples in I

    shown below. ForCi = p q r, simply replacea by b. It can be verifiedthat all fixes of I and key(1) must contain F1, F2, or F3 shown below, up to apermutation of p, q, r ; the chase results in G1, G2, and G3, respectively. Noteincidentally that the shown parts ofF1, F2, and F3are not comparable by .

    I 1 2 3...

    i p a

    i q a

    i r a...

    F1 1 2 3...

    i p a

    i a

    i a...

    F2 1 2 3...

    i p a

    i a

    r a...

    F3 1 2 3...

    i p a

    q a

    r a...

    G1 1 2 3...

    i p a...

    G2 1 2 3...

    i p a

    r a...

    G3 = F3

    Let B and hdenote the body and the head of, respectively. We show that has a satisfying truth assignment if and only if(Ikey(1)) = {}. For the if part,assume (Ikey(1)) = {}. Since is Boolean, by Lemma 2.6 and Theorem 3.22,for some fix F of I and , G B whereG is the final tableau in a chase of Fby key(1). For every i {1, 2, . . . , n}, G contains exactly one tuple of the formi, , , where is a Boolean variable and {a, b}.Let be a truth assignmentsuch that(p) = true ifi, p, a G , and(p) = false ifi, p, b G . The samevariable cannot be assigned both true en false, or elseG B, a contradiction.It can be easily verified that is a satisfying truth assignment of. Theonly-if

    partis now straightforward.

    Obviously, when tableau queries can appear disguised as ftgds, the restrictionto quantifier-free tableau queries does not lower the complexity of consistentquery answering.

    COROLLARY 7.5. For the ftgd1 and the BooleanqfCQquery2,

    1 1 2 3x y a

    z y b

    1 1 1

    2 1 2 31 1 11

    ,

    CQA(key(1) {1}, 2)is NP-complete.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    32/47

    Database Repairing Using Updates 753

    8. CONTRADICTION-GENERATING DEPENDENCIES

    We study update-based repairing when all constraints are fegds of the form

    (B, 0 = 1). The nucleus construction (and hence consistent query answering)is tractable for quantifier-free tableau queries, but consistent query answeringbecomes intractable for linear tableau queries.

    8.1 Comprehensive Example

    Definition 8.1. A contradiction-generating dependency (cgd) is an fegd ofthe form (B, 0 = 1), where 0 and 1 are distinct constants.

    An example is the cgd 3 of our running example, which expresses that carmodels available in luxury versions do not exist in bottom priced variants.

    A fortiori, luxury models are never bottom priced (7). Each cgd correspondslogically to a Horn clause without a positive literal: the cgd 3expresses xy z(R(x, luxury,z) R(x, y , bottom)).

    3 Model Version PriceRangex luxury zx y bottom

    0 = 1

    7 Model Version PriceRangex luxury bottom

    0 = 1

    Since3 logically implies 7,{3, 7}and {3}are equivalent sets of constraints.An entry in (the body of) a cgd isirrelevantif it contains a variable that occursonly once in the cgd; otherwise it is relevant. In 3, the irrelevant entries arethe entries containing y andz ; in 7, the entry containingx is irrelevant.

    We illustrate how the following relation CARS= {s, t}can be made subcon-sistent relative to {3, 7}. Obviously, every tableau that is subconsistent withrespect to to a set of cgds is also consistent.

    CARS Model Version PriceRange4 4 luxury medium (s)4 4 luxury bottom (t)

    The substitution 1 = {x/4 4, y/luxury,z/medium} is a one-one homomor-phism from the body of 3 to CARS, hence CARS is inconsistent. In order togain consistency, we need to put a variable at any entry ofCARSthat is relatedvia1 to a relevant entry of3: at the first coordinate ofs or t (both related tox), the second coordinate ofs, or the third coordinate oft . This will be encodedas the fixset D1 = {(s, 1 ) , (t, 1 ) , (s, 2 ) , (t, 3)} ofCARS-entries. Note incidentallythat the third coordinate ofs (medium) is related via 1 to an irrelevant entryof3 (the entry containingz ) and that substituting a variable for medium isuseless for gaining consistency. Likewise, the substitution 2 = {x/4 4} is aone-one homomorphism from the body of 7 to CARS, resulting in the fixset

    D2 = {(t, 2 ) , (t, 3)}. Note here that the entry containingx is not relevant in 7.For the current set of cgds, there are no further one-one homomorphisms.

    Next, if E is a hitting set of {D1,D2} (E intersects both D1 and D2),then CARS is rendered consistent by putting variables at all entries of E.There are four minimal (with respect to ) hitting sets: E1 = {(s, 1 ) , (t, 2)},E2 = {(s, 2 ) , (t, 2)}, E3 = {(t, 1 ) , (t, 2)}, E4 = {(t, 3)}. For example, we show the

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    33/47

    754 Jef Wijsen

    consistent tableauE1 (CARS) obtained fromCARSby putting variables at theentries ofE1(see Definition 5.6):

    E1 (CARS) Model Version PriceRange luxury medium

    4 4 bottom

    8.2 -closed Sets of Contradiction-Generating Dependencies

    The starting point of the introductory example is a set = {3, 7} of cgds suchthat for any tableauF, ifF is inconsistent, then the body of some cgd is one-onehomomorphic to F. The limitation to homomorphisms that are one-one is nota limitation (Theorem 8.3) and eases the technical treatment to follow.

    Definition 8.2. A set of cgds is -closedif and only if for every tableauF, ifF B for some (B, 0 = 1) in , then for some (B, 0 = 1) in , F B.

    To see that {3} is not -closed, it suffices to note that the body of3is not one-one homomorphic to the (inconsistent) body of7. On the other hand, {3, 7} is-closed.

    Theorem 8.3 shows that every set of cgds is equivalent to a -closed setof cgds. Moreover, the proof provides an effective way to construct a -closedequivalent set of cgds. It is not difficult to develop more efficient constructionsusing most general unifiers [Abiteboul et al. 1995].

    THEOREM 8.3. Every set of cgds is logically equivalent to a -closed set ofcgds.

    PROOF. Assumearity n. For every cgd = (B, 0 = 1)of, constructas theset of cgds containing (B, 0 =1) for every tableau B satisfying (i) asd(B) asd(B), (ii) |B| |B|, and (iii) (B) = B for some substitution . Clearly,

    is finite because of (i) and (ii). Note incidentally . Let

    =

    . It iseasy to verify that and are equivalent.To show that is-closed, assume a tableau Fsuch that F B for some

    cgd (B, 0 = 1) in . The construction ensures the existence of a cgd (B, 0 = 1)in such that B B. By transitivity, F B. We can assume a substitutionfor the variables in Bsuch that (B) F. LetC be the set of constants thatoccur in(B) but not in B. For eachc C, let xcdenote a new distinct variablenot occurring elsewhere. Let be the substitution satisfying for each variablex occurring in B, if(x) = cfor somec C, then(x) =xc, else (x) = (x).

    Let G = (B). Obviously, (B) G. We have |G| |B|, G contains noconstants not in B, andG contains not more variables than B. Then, our con-struction ensures that G Hfor some cgd (H, 0 = 1) in. From(B) F, itfollowsF (B). By transitivity, F G . SinceG H, F H. This concludes

    the proof.

    8.3 Hitting Set

    Hitting sets are well-known from complexity theory. Significantly, when werefer to minimal hitting sets in the technical treatment to follow, minimalityrefers to set inclusion, not cardinality.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    34/47

    Database Repairing Using Updates 755

    Definition 8.4. LetUbe a finite set and S 2U. A set E U is a hittingsetofSif for every S S, E S = {}. A hitting set E ofSis said to beminimal

    if no proper subset ofE is also a hitting set ofS.LEMMA 8.5. Let U be a finite set. Let S 2U. Let E be a minimal (with

    respect to ) hitting set of S. For every b E, there exists S S such that

    E S = {b}.

    PROOF. Assumeb E. SinceE is a hitting set ofS, for every S S, E S ={}. Assume that for every S S, E S = {b}. Then, E \ {b} is a hitting set ofS,henceE is not minimal, a contradiction. We conclude by contradiction that forsomeS S, E S = {b}.

    We provide a necessary and sufficient condition for the existence of a minimalhitting set that contains a given elementb.

    THEOREM 8.6. Let U be a finite set. LetS 2U and b U. Then, S has a

    minimal (with respect to ) hitting set containing b if and only if there exists aset Sb Ssuch that b Sband for every S S, if b S, then S Sb.

    PROOF. If part.Assume Sb S such thatb Sband for every S S,b Simplies S Sb, i.e. S \ Sb = {}. Let E be a set such that b E and for everyS S such that b S, E contains an element ofS \ Sb. Clearly, E is a hittingset ofS, andE \ {b} is not a hitting set ofS. It follows that E contains a minimal(with respect to ) hitting set E ofS such that E containsb.

    Only-if part. Assume E is a minimal (with respect to ) hitting set of Scontainingb. By Lemma 8.5, we can assume the existence of an element Sb Ssuch that Sb E = {b}. Assume S S such thatb S . Since E is a hitting setofS, we can assume c E such that c S. Assume c Sb. From c S andb S, it follows c = b. Then Sb E contains two distinct elements b and c , acontradiction. We conclude by contradiction c S

    b. Sincec S , S S

    b.

    Consequently, it can be decided in polynomial time in |S| whether some minimalhitting set ofS contains a given element b.

    8.4 Fixes and Hitting Sets

    An entry in a tableau Fis not relevant if it contains a variable not occurringelsewhere. Relevant entries are used for determining fixsets, as was shown inthe introductory example.

    Definition 8.7. Let Fbe a tableau of arity n. An entry (t, i) ofFis said toberelevantin F ift (i) dom or if for some entry (s, j ) in F, (t, i) =(s, j ) andt(i) = s(j ). We writerelev(F) for the set of relevant entries ofF.

    Let Ibe a relation and a set of cgds. If for some (B, 0 = 1) of, the

    substitutionis a one-one homomorphism from B to I, then the set {((t), i) |(t, i) relev(B)}of I-entries is called a fixsetof I and . We write fixsets(I, )for the set of all fixsets ofIand .

    The set fixsets(I, ) can be constructed in polynomial time in |I|: every cgd(B, 0 = 1) results in no more than |I|m fixsets, where m = |B|. We show thathitting sets correspond to consistent tableaux.

    ACM Transactions on Database Systems, Vol. 30, No. 3, September 2005.

  • 8/10/2019 wijsen-2005-1-ACMTODS.pdf

    35/47

    756 Jef Wijsen

    LEMMA 8.8. Assume arity n. Let I be a relation and a -closed set of cgds.Let E be a set of I -entries (i.e. E I {1, 2, . . . , n}). Then, E is a hitting set offixsets(I, )if and only ifE (I)is consistent.

    PROOF. If part.AssumeE is not a hitting set offixsets(I, ). We can assumea cgd (B, 0 = 1) inand a substitutionsuch that(B) I,identifies no twodistinct tuples of B, and E contains no element of the fixset {((t), i) | (t, i) relev(B)}. Let be the substitution such that for each symbol pthat occurs inB(let (t, i) be a B-entry such thatt (i) = p),(p) = E ((t), i).

    We show that is the identity on constants: ift (i) is a constant (say a) forsome entry (t, i) in B, then, since (t, i) relev(B), ((t), i) E, hence (a) =E ((t), i) = a. To show that is well-defined on variables, we prove that if

    s(j ) = t(j ) is the same variable (say x) for distinct B-entries (s, j ) and (t, i),then E ((t), i) = E ((s), j ). Since (t, i) relev(B) and (s, j ) relev(B), itfollows ((t), i) E and ((s), j ) E. Then, E ((t), i) = (t(i)) = (x) andE ((s), j ) = (s(j )) = (x), hence E ((t), i) = E ((s), j ).

    We have (B) = E ((B)) E (I), hence E (I) is inconsistent.Only-if part. Assume E (I) inconsistent. Since is -closed, there exists

    a cgd (B, 0 = 1) in and a substitution such that (B) E (I) and identifies no two distinct tuples of B. Let be the substitution such that foreveryt E (I), E ((t)) = t. Let = . Then, (B) I and identifiesno two distinct tuples of B. Let K = {((s), j ) | (s, j ) relev(B)}. Clearly,K fixsets(I, ). To show that E is not a hitting set offixsets(I, ), it suffices toshow thatE contains no element ofK. Assume, on the contrary, thatE contains((s), j ) K. Two cases can occur:

    Cases(j ) is a constant. FromE ((s), j ) = E (((s)), j ) = (