CONSTRAINTS AND PREDICATES: A BRIEF TUTORIAL PART 2
by C. J. Date

 

 

 

This is the second part of a three-part tutorial article on the subject of integrity constraints. In Part 1, after a number of preliminaries, I discussed a set of six examples, giving in each case a natural-language statement and a formal expression of the constraint in question.  Now I want to move on to introduce some simple but fundamental concepts and terminology. 

 

Let's return to the first example from Part I (Every supplier status value is in the range 1 to 100 inclusive).  Here again is the precise formulation: 

 

FORALL s#*S#,sn*NAME,st*INTEGER,sc*CHAR

 (IF {S# s#,SNAME sn,STATUS st,CITY sc}*S

  THEN st*1 AND st*100)

 

As I pointed out in Part I, this constraint involves just a single relvar; in fact, it involves just one variable of any kind--namely, the suppliers relvar S.  Let me immediately qualify that remark!  In fact, of course, the constraint does also involve certain so-called "bound" variables: namely, s#, sn, st, and sc.  However, bound variables aren't variables in the usual programming language sense--instead, they're variables in the sense of logic. You can think of a bound variable as a kind of "dummy" variable, in a sense.  For consider: 

 

·   If we replace all appearances within the formal expression shown above of, e.g., the symbol st by any other symbol, say the symbol xyz, the constraint remains logically unchanged.

·   By contrast, the same is certainly not true if we replace all appearances of, e.g., the symbol S--which denotes a "nondummy" variable, of course--by, say, the symbol SP. 

 

From this point forward, I'll take the term 'variable' to mean a variable in the usual programming language sense specifically, not a variable in the sense of logic (barring explicit statements to the contrary, of course). 

 

To say it again, then, the formal expression of the constraint involves a variable, S. Thus, although that expression is indeed truth-valued, we can't say what its value is--i.e., we can't say what truth-value it yields--until we substitute a value for that variable.  (Indeed, different substitutions will yield different truth-values, in general.)  In other words, the expression is a predicate.  What's a predicate?  In general, a predicate is a truth-valued function (i.e., a function that, when it's invoked, returns a truth value); it has a set of parameters (as do all functions), and an argument value has to be supplied for each such parameter when the function is invoked ("when the predicate is instantiated," as the logicians say).  In the case at hand, then, when we want to instantiate the predicate--which is to say, when we want to check the constraint--we supply as argument (the sole argument, as it happens) the relation that's the current value of relvar S, and the expression can then be evaluated. 

 

Now, when we do instantiate that predicate, and in effect replace the sole parameter by some argument, we wind up with a truth-valued expression that involves no variables at all, only values.  And, of course, analogous remarks apply to constraints involving two, three, four, ..., or any number of relvars; in all cases, when we need to evaluate the expression (i.e., when we need to check the constraint), we replace each parameter by the current value of the applicable relvar, and what we wind up with is an expression that involves no variables at all. 

 

In logic, a truth-valued expression that involves no variables at all is said to be a proposition. A proposition thus evaluates to either true or false, unequivocally (it can be thought of as a degenerate predicate--i.e., a predicate that involves zero variables).  Here are a few simple examples: 

 

·   The sun is a star.

·   The moon is a star.

·   The sun is further away from us than the moon.

·   George W. Bush won the US presidential election in the year 2000. 

 

I'll leave it to you to decide which of these propositions evaluate to true and which to false.  Do please note, however, that not all propositions are ones that evaluate to true (to think otherwise is a mistake all too easily made). 

 

Terminology:  Propositions are also said to be closed expressions, while expressions that involve at least one variable are said to be open expressions. In general, therefore, propositions are closed expressions and predicates are open ones (except for the degenerate case in which the predicate in question is in fact a proposition). 

 

Anyway, it should be clear from all of the above that, logically speaking, a constraint is a predicate--but when the constraint is checked, argument values are substituted for the parameters, thereby reducing the predicate to a proposition, and that proposition is then required to evaluate to true

 

Of course, any given relvar will be subject to many constraints, in general.  Let R be a relvar.  Then the relvar predicate for R is the 'logical AND' or conjunction of all of the constraints that apply to (i.e., mention) that relvar R.  Please don't get confused here!--each individual constraint is a predicate in its own right, as we already know, but "the" relvar predicate is the conjunction of all of those individual predicates.  For example, if we assume for simplicity that the six constraints discussed in Part I of this article are the only constraints that apply to the suppliers and parts database (apart from a priori ones), then the relvar predicate for suppliers is the conjunction of Numbers 1, 2, 4, 5, and 6, and the relvar predicate for shipments is the conjunction of Numbers 5 and 6.  Note that these two relvar predicates "overlap" each other, in a sense, inasmuch as they have certain constituent constraints in common. 

 

Note: I might possibly be accused of moving the goal posts here, slightly. In my book AN INTRODUCTION TO DATABASE SYSTEMS, I defined the relvar predicate for relvar R to be the conjunction of all relvar constraints that applied to R (where, as explained in Part I, the term 'relvar constraint' means a constraint that refers to just one relvar).  In THE THIRD MANIFESTO, by contrast, Hugh Darwen and I refined that definition; to be specific, we now define the relvar predicate for relvar R to be the conjunction of all constraints, not just all relvar constraints, that apply to R.  Apologies to anyone who might find this shift in terminology confusing. 

 

Now let R be a relvar, and let RP be the relvar predicate for R.  Clearly, then, R must never be allowed to have a value that, when it's substituted for R in RP (and when any other necessary argument-for-parameter substitutions have also been made in RP), causes RP to evaluate to false.  In fact, I can now introduce "The Golden Rule" (or the first version of that rule, at any rate): 

 

No update operation must ever assign to any relvar a value that causes its relvar predicate to evaluate to false. 

 

Now let D be a database, and let D contain relvars R1, R2, ...Rn (only).  Let the relvar predicates for those relvars be RP1, RP2, ..., RPn, respectively.  Then the database predicate for D--DP, say--is the conjunction of all of those relvar predicates: 

 

DP  *  RP1 AND RP2  AND ... AND RPn

 

As an aside, let me remind you that (as we've seen) two distinct relvar predicates RPi and RPj might have certain constituent constraints in common.  It follows that the very same constraint might appear many times over in the database predicate DP.  From a logical point of view, of course, there's no harm in this state of affairs, because if c is a constraint, then c AND c is logically equivalent to just c--but I naturally hope the system would be smart enough in such a situation to evaluate c once and not twice. 

 

Here then is the extended (more general, and final) version of the Golden Rule: 

 

No update operation must ever assign to any database a value that causes its database predicate to evaluate to false. 

 

Of course, a database predicate will evaluate to false if and only if at least one of its constituent relvar predicates does so too.  Likewise, a relvar predicate will evaluate to false if and only if at least one of its constituent constraints does so too. 

 

I'll have more to say about The Golden Rule in a few moments.  First, however, I want to consider the question of what everything I've said so far implies for the practical issue of actually doing constraint checking.  Once again, let's return to our very first example (Every supplier status value is in the range 1 to 100 inclusive): 

 

FORALL s#*S#,sn*NAME,st*INTEGER,sc*CHAR

 (IF {S# s#,SNAME sn,STATUS st,CITY sc}*S

  THEN st*1 AND st*100)

 

As I pointed out in Part I of this article, this constraint is effectively saying that IF a certain tuple appears in relvar S, THEN that tuple has to satisfy a certain condition ("status in the range 1 to 100," in the case at hand).  So, apparently, if we try to insert a new supplier tuple with status (say) 200, the sequence of events has to be: 

 

1.      Insert the new tuple.

2.      Check the constraint.

3.      Undo the update (because the check fails). 

 

But this is absurd!  Clearly, we would like to catch the error before the INSERT is done in the first place.  So what the implementation clearly has to do is use the formal expression of the constraint to infer the appropriate check(s) to be performed on tuples that are presented for insertion.  I don't want to get into details of what's involved in that inference process here.  However, I think you can at least see that if the database predicate includes a constraint of the form

 

IF {S# s#,SNAME sn,STATUS st,CITY sc}*S

THEN ...

 

-- i.e., if the antecedent (the piece between the IF and the THEN) in some implication within the overall predicate is of the form "some tuple appears in S"--then the consequent (the piece after the THEN) in that implication is essentially a constraint on tuples that are presented for insertion into relvar S. 

 

I remark too that if the database is designed in accordance with the Principle of Orthogonal Design--see the book mentioned a couple of times already, AN INTRODUCTION TO DATABASE SYSTEMS--then any tuple presented for insertion will be a plausible INSERT candidate for at most one relvar in the database (implying that the tuple will have to be checked against the relvar predicate for at most one relvar in that database).  But now I'm beginning to stray into an area that deserves a sizable discussion of its own, and I think I'd better leave it to a subsequent article. 

 

Back to theGolden Rule.Now, I didn't point out the fact explicitly before, but you might have realized for yourself that the rule as stated implies that all checking is immediate.  Why?  Because it talks in terms of update operations--in other words, INSERT, DELETE, and UPDATE operations, loosely speaking--and not in terms of transactions (see below).  In effect, therefore, theGolden Rulerequires integrity constraints to be satisfied at statement boundaries, and there's no notion of "deferred" or COMMIT-time integrity checking at all. 

 

In order to explain the foregoing properly, I first need to say a little more about transactions.  Of course, transactions are another big topic in their own right, even if we limit our attention (as I want to do here) to their integrity aspects specifically, so what follows is only the briefest of brief sketches.  I plan to elaborate on the points involved in another subsequent article. 

 

First of all, then, I do assume you're familiar with the transaction concept: in particular, with the so-called "ACID properties" of transactions. ACID is an acronym for atomicity - consistency - isolation - durability.  Atomicity means transactions are "all or nothing."  Consistency means they transform a consistent state of the database into another consistent state, without necessarily preserving consistency at all intermediate points.  Isolation means that any given transaction's updates are concealed from all other transactions, until the given transaction commits.  Durability means that once a transaction commits, its updates survive in the database, even if there's a subsequent system crash.  The standard reference on transactions is the highly recommended book TRANSACTION PROCESSING: CONCEPTS AND TECHNIQUES, by Jim Gray and Andreas Reuter (Morgan Kaufmann, 1993). 

 

Back to the Golden Ruleonce again.  As you'll surely realize, the idea that all checking must be immediate, not deferred, is a very unorthodox one.  After all, the SQL standard certainly includes support for deferred checking, and I believe there's at least one commercial product--namely, IBM's SQL/DS--that implements such a capability (in fact, I believe SQL/DS actually allows checking to be deferred past COMMIT time, which isn't allowed in the standard).  Moreover, one of the arguments in favor of the transaction concept has always been that transactions are supposed to be (among other things) a unit of integrity:  As I said a few moments ago, they're supposed to "transform a consistent state of the database into another consistent state, without necessarily preserving consistency at all intermediate points."  So do I contradict myself?  Very well, I contradict myself.  To be specific, I no longer believe in transactions as a "unit of integrity"; instead, I now think statements have to be that unit. 

 

So why have I changed my mind?  I have at least three reasons, but easily the biggest is as follows.  As I've written elsewhere, a database can be regarded as a collection of propositions, assumed by convention to be ones that evaluate to true.  And if that collection is ever allowed to include any inconsistencies,  then all bets are off!  You can never trust the answers you get from an inconsistent database; in fact, you can get absolutely any answer whatsoever from such a database (indeed, it's easy to prove that you can get absolutely any answer you like--or don't like--from such a database).  While proper concurrency control, or in other words the isolation or "I" property of transactions, can mean that no more than one transaction ever sees any particular inconsistency, the fact remains nonetheless that that particular transaction does see the inconsistency and can thus produce wrong answers.  (Indeed, it's precisely because inconsistencies can't be tolerated, not even if they're visible to no more than one transaction at a time, that we need the constraints to be enforced in the first place--and that's why integrity is such a fundamental feature of a database system.) 

 

Second, I don't really agree with the argument that any given inconsistency can be seen by just one transaction, anyway.  For consider: 

 

·   Obviously, transactions must be allowed to communicate with the outside world (where by "the outside world" I mean the world outside the database). 

·   It follows that it's only because of certain protocols--certain un-enforced (and in fact unenforceable) protocols, I might add--that transactions are isolated from one another.  If transaction T1 obtains some incorrect information from the database and writes it to a file F, and transaction T2 then reads that same information from file F, then T1 and T2 aren't really isolated from each other.  Even if they run at totally different (i.e., non-overlapping) times! 

·   In the same kind of way, a transaction can obviously provide incorrect information to the end user. 

 

In other words, I don't really believe in the "I" property of transactions. 

 

Third, semantic optimization requires the database to be consistent at all times, not just at transaction boundaries.  (I don't really want to get into details of semantic optimization here; suffice it to say that it's a technique for using integrity constraints to simplify queries--in order to improve performance, of course.  Clearly, if the constraints aren't satisfied, then the simplifications will be invalid, and the query answers will be incorrect, in general.) 

 

But--I hear some readers objecting--surely some integrity checks have to be deferred, don't they?  As a trivial example, consider the constraint "Supplier S1 and part P1 are in the same city."  If supplier S1 moves, say from London to Paris, then part P1 must move from London to Paris as well.  The conventional solution to this problem is to wrap the two updates up into a single transaction, like this: 

 

BEGIN TRANSACTION ;

 UPDATE s WHERE s# = s# ('S1') city := 'Paris' ;

 UPDATE p WHERE P# = P# ('P1') city := 'Paris' ;

COMMIT ;

 

In this conventional solution, the constraint is defined to be DEFERRED, and the checking is done at COMMIT--and the database is inconsistent between the two UPDATE operations.  Note in particular that if the same transaction were to ask the question "Are supplier S1 and part P1 in the same city?" between the two UPDATEs, it would get the answer no

 

By contrast, THE THIRD MANIFESTO solves this kind of problem by introducing a multiple assignment operator, which lets us carry out several assignments as a single operation, without any integrity checking being done until all of the assignments in question have been executed.  (INSERT, DELETE, and UPDATE are of course just shorthand for certain assignment operations, as I pointed out in an earlier article in this occasional series, viz., There's Only One Relational Model!  For example: 

 

UPDATE s WHERE s# = s# ('S1') city := 'Paris' ,

UPDATE P WHERE P# = P# ('P1') city := 'Paris' ;

 

Note the comma separator, which means the two UPDATEs are both part of the same statement.  (Of course, I don't mean to suggest that the solution to the problem under consideration can be obtained by a tiny change in syntax!  Rather, the point is that we do need to be able to perform the two UPDATEs as part of the same statement--in parallel, in fact--and so we need some syntax to specify the necessary "bundling."  So we use a comma for that purpose, and no integrity checking is done "until we get to the semicolon."  Note in particular that there's now no way for the transaction to see an inconsistent state of the database between the two UPDATEs, because the notion of "between the two UPDATEs" is one that has no meaning in this context. 

 

Given the availability of such an operator, there's now no need at all for deferred checking in the traditional sense (i.e., checking that's deferred to end-of-transaction).  As already indicated, however, this is a topic that deserves an article of its own, and I hope to write that article soon. 

 

(To be continued in Part 3

 

 

Posted 05/10/02