CONSTRAINTS AND PREDICATES: A BRIEF TUTORIAL PART 3
by C. J. Date

 

 

 

In Parts 1 and 2 of this article, I showed among other things that each relvar has a relvar predicate and the overall database has a database predicate.  Now, I hope it's obvious that the predicates in question are all ones that are "understood by the DBMS":  They're stated formally (they're part of the database definition, in fact), and of course they're enforced by the DBMS, too.  For such reasons, it's convenient to refer to the predicates in question, on occasion, as internal predicates specifically--principally because relvars and databases also have external predicates, which I want to discuss next. 

 

The first and most significant point is that, while internal predicates are a formal construct, external predicates are an informal construct merely.  Internal predicates are (loosely) what the data means to the DBMS; external predicates, by contrast, are what the data means to the user.  Of course, users have to understand the internal predicates as well as the external ones (why?), but, to repeat, the DBMS has to understand--indeed, can only understand--the internal ones.  In fact, we might say, loosely, that a given internal predicate is the DBMS's approximation to the corresponding external predicate (see later). 

 

Now let's concentrate on relvars specifically (I'll come back and deal with the overall database later). As just suggested, the external predicate for a given relvar is basically what that relvar means to the user.  In the case of the suppliers relvar S, for example, the external predicate might look something like this: 

 

The supplier with the specified supplier number (S#) has the specified name (SNAME) and the specified status (STATUS) and is located in the specified city (CITY).  Moreover, the status value is in the range 1 to 100 inclusive, and must be 20 if the city is London.  Also, no two distinct suppliers have the same supplier number.

 

For the sake of the discussion that follows, however, I'm going to replace this predicate by the following simpler one: 

 

Supplier S# is named SNAME, has status STATUS, and is located in CITY. 

 

After all, the external predicate is (as already stated) only informal, so we're at liberty to make it as simple or as complex as we like. 

 

Now, note that the foregoing statement is indeed a predicate (i.e., it's a truth-valued function):  It has four parameters, viz., S#, SNAME, STATUS, and CITY, corresponding to the four attributes of the relvar, and when arguments of the appropriate types are substituted for those parameters, we obtain a proposition, or something that's categorically either true or false

 

Note: You might notice that I'm using the term 'parameter' now in a sense slightly different from--though not violently different from--the sense in which I was using it in Part II of this article.  In Part II, parameters (and corresponding arguments) denoted whole relations; now, they denote individual attribute values.  In fact, the parameters I'm talking about now correspond very closely to those bound variables that appeared in the formal constraints I showed in Part I of this article. 

 

Each tuple appearing in relvar S at any given time can now be regarded as denoting a certain proposition, obtained by instantiating the foregoing external predicate.  As already noted, the instantiation involves substituting arguments of the appropriate type for the parameters of the external predicate.  For example, if we substitute the S# value S1, the NAME value Smith, the INTEGER value 20, and the CHAR value London for the relevant parameters, we obtain the proposition

 

Supplier S1 is named Smith, has status 20, and is located in London. 

 

Likewise, if we substitute the S# value S2, the NAME value Jones, the INTEGER value 10, and the CHAR value Paris, we obtain

 

Supplier S2 is named Jones, has status 10, and is located in Paris. 

 

And so on.     

 

To repeat, tuples of the relvar correspond to instantiations of the external predicate.  And--very important!--those particular instantiations (i.e., those particular propositions) are ones that are understood by convention to be ones that evaluate to true.  For example, if the tuple

 

{S# S#('S1'),SNAME NAME('Smith'),STATUS 20,CITY 'London'}

 

indeed does appear in relvar S at some time, then we're to understand that it's a "true fact" that there does exist at that time a supplier with supplier number S1, named Smith, with status 20, and located in London (Yes, I know that facts are true by definition and "true fact" is a pleonasm, but--like Edward Abbey--I like the emphasis). More generally: 

 

IF (s * S) = true THEN XPS (s) = true

 

Here:

 

·   s is a tuple of the form { S# s#, SNAME sn, STATUS st, CITY sc }, where s# is a value of type S#, sn is a value of type NAME, st is a value of type INTEGER, and sc is a value of type CITY); 

·   XPS is the external predicate for suppliers;

·   XPS(s) is the proposition obtained by instantiating XPS with the argument values S# = s#, SNAME = sn, STATUS = st, and CITY = sc

 

However, we go a step further with external predicates than we do with internal ones.  To be specific, we adopt the Closed World Assumption, which says that if an otherwise valid tuple does not appear in the relvar at some time, then the corresponding proposition is understood by convention to be one that evaluates to false at that time.  For example, if the tuple

 

{S# S#('S6'),SNAME NAME('Lopez'),STATUS 30,CITY 'Madrid'}

 

does not appear in relvar S at some particular time, then we're to understand that it's not the case that there exists at that time a supplier with supplier number S6, named Lopez, with status 30, and located in Madrid.  More generally: 

 

IF NOT ((s * S) = true) THEN NOT (XPS (s) = true)

 

Or more succinctly: 

 

IF NOT (s * S) THEN NOT XPS (s)

 

Putting the foregoing together, we have simply: 

 

s * S * XPS (s)

 

In words:  A given tuple appears in a given relvar at a given time if and only if that tuple causes that relvar's external predicate to evaluate to true at that time (the symbol "*" can be read as "if and only if" or "is equivalent to").  It follows that a given relvar contains all and only the tuples that correspond to true instantiations of that relvar's external predicate (at the particular time in question). 

 

I've said that external predicates, and the propositions obtained by instantiating such predicates, aren't--and in fact can't be--"understood by the DBMS."  There's no way, for example, that the DBMS can know what it means for a "supplier" to "be located" somewhere, or what it means for a "supplier" to "have a status" (etc.).  All such issues are matters of interpretation; they're matters that make sense to the user but not to the system.  For example, if the supplier number S1 and the city name London happen to appear together in the same tuple, then the user can interpret that fact to mean that supplier S1 is located in London, but (to repeat) there's no way the system can do anything analogous. 

 

What's more, even if the DBMS could know what it means for a supplier to be located somewhere, it still couldn't know a priori whether what the user tells it is true!  If the user tries to assert to the system (e.g., by executing an INSERT operation) that supplier S1 is located in London, there's no way for the DBMS to know whether that assertion is true.  All the DBMS can do is make sure that the user's assertion doesn't lead to any constraint violations (i.e., it doesn't cause any internal predicate to evaluate to false).  Assuming it doesn't, then the DBMS must accept the user's assertion and treat it as true from that point forward (or, rather, until the user tells the DBMS--typically via a corresponding DELETE operation--that it isn't true any more, of course). 

 

By the way, the foregoing paragraph shows clearly why the Closed World Assumption doesn't apply to internal predicates.  To be specific, a tuple might satisfy the internal predicate for a given relvar and yet validly not appear in that relvar, because it doesn't correspond to a true proposition in the real world. 

 

We can summarize all of the above by saying that, informally, the external predicate for a given relvar is the intended interpretation for that relvar.  As such, it's important to the user, but not (very) important to the DBMS.  We can also say, again informally, that the external predicate for a given relvar is the criterion for acceptability of updates on the relvar in question--i.e., it dictates,  at least in principle, whether a particular INSERT or DELETE or UPDATE operation on that relvar can be allowed to succeed.  Ideally, therefore, the DBMS would know and understand the external predicate for every relvar, so that it could deal correctly with all possible attempts to update the database.  Unfortunately, however, we've already seen that this goal is unachievable; that is, the DBMS cannot know any given external predicate.  But it does know a good approximation:  It knows the corresponding internal predicate--and that's what it will enforce.  (Thus, the pragmatic "criterion for acceptability of updates," as opposed to the ideal one, is the internal predicate, not the external one.)  Another way of saying the same thing is this: 

 

The DBMS can't enforce truth, only consistency. 

 

That is, the DBMS can't guarantee that the database contains only true propositions--all it can do is guarantee that it doesn't contain anything that causes any integrity constraint to be violated (i.e. it doesn't contain any inconsistencies).  Sadly, truth and consistency aren't the same thing!  Indeed, we can observe that: 

 

·   If the database contains only true propositions, then it's consistent, but the converse isn't necessarily so.

·   If the database is inconsistent, then it contains at least one "false fact," but the converse isn't necessarily so. 

 

More succinctly:  Correct implies consistent (but not the other way around), and inconsistent implies incorrect (but not the other way around).  By my use of the term correct here, of course, I mean to say that the database is correct if and only if it fully reflects the true state of affairs in the real world.  

 

I've said that external predicates are important to the user.  Why?  Well, because (as previously stated) those external predicates represent the intended interpretation for the corresponding relvars.  The point is this:  Clearly, each relvar is supposed to represent some aspect of the real world--in other words, each relvar is supposed to have a meaning--and users need to know those meanings (i.e., those intended interpretations) if they're to be able to use the database effectively and efficiently.  That's all. 

 

There's another point to be made here, though.  The fact is, all this business of external predicates is very closely related to the business of (logical) database design.  Indeed, at a certain rather high level of abstraction (and ignoring the fact that, in practice, the process is highly iterative, as we all know), we can characterize the design process as consisting of the following two steps: 

 

1.      Pin down the external predicates as carefully as possible--albeit informally, of course.

2.      Map the output from the first step into a set of formal relvars and corresponding internal predicates. 

 

Which brings me to a small piece of unfinished business.  I've now discussed external relvar predicates at some length.  What about the overall database?  Well, as I'm sure you've guessed, the external predicate for the overall database--which represents the intended interpretation for that database--is simply the conjunction (i.e., the logical AND) of the external predicates for all of the relvars in that database. 

 

There's another important point that I've been ducking so far but I now need to spell out, as follows:  Basically, everything I've been saying applies to all relvars, not just to what THE THIRD MANIFESTO calls real ones (i.e., what SQL would call "base tables").  In particular, it applies to views ("virtual relvars," as THE THIRD MANIFESTOcalls them).  Thus, views too are subject to constraints, and they have relvar predicates, both internal and external.  For example, suppose we define a view--let's call it SST--by projecting the suppliers relvar over attributes S#, SNAME, and STATUS (thereby effectively removing attribute CITY).  Then the external predicate for view SST looks something like this:

 

There exists some city CITY such that supplier S# is named SNAME, has status STATUS, and is located in CITY.

 

Observe that, as required, this predicate does have three parameters, not four, corresponding to the three attributes of relvar SST (CITY is now no longer a parameter but a bound variable instead, thanks to the fact that it is quantified by the phrase "there exists some city").  Another, perhaps clearer, way of making the same point is to observe that the predicate as stated is logically equivalent to this one: 

 

Supplier S# is named SNAME, has status STATUS, and is located in some city.

 

This version of the predicate very clearly has just three parameters. 

 

What about the internal predicate?  Here again are the six constraints from Part I of this article: 

 

1.      Every supplier status value is in the range 1 to 100 inclusive. 

2.      Every supplier in London has status 20. 

3.      If there are any parts at all, at least one of them is blue. 

4.      No two distinct suppliers have the same supplier number. 

5.      Every shipment involves an existing supplier. 

6.      No supplier with status less than 20 supplies any part in a quantity greater than 500. 

 

Of these six, the third is clearly irrelevant as far as view SST is concerned, since it has to do with parts, not suppliers.  As for the rest, each of them does also apply to view SST, but in a slightly modified form.  Here, for example, is the modified form of Number 5: 

 

FORALL s#*S#,p#*P#,q*QTY

 (IF {S# s#,P# p#,QTY q}*SP

  THEN EXISTS sn*NAME,st*INTEGER

    ({S# s#,SNAME sn,STATUS st}*SST))

 

The changes are in the third and fourth lines:  All references to CITY have been dropped, and the reference to S has been replaced by a reference to SST.  Note that we can regard this constraint for SST as being derived from the corresponding constraint for S, much as relvar SST itself is derived from relvar S.  Note too that (to use SQL terminology) this constraint for SST is effectively a foreign key constraint from a base table to a view! 

 

Analogous remarks apply directly to Numbers 1, 2, 4, and 6.  Number 2 is slightly tricky, though.  Here it is: 

 

FORALL s#*S#,sn*NAME,st*INTEGER

 (IF {S# s#,SNAME sn,STATUS st}*SST

  THEN EXISTS sc * CHAR

    ({S# s#,SNAME sn,STATUS st,CITY sc}*S AND

     (IF sc = 'London'

      THEN st = 20))

 

Again, however, we can regard this constraint as being derived from the corresponding constraint for S. 

 

In Part I of this article, I mentioned that several people have tried to come up with a taxonomy or classification scheme for constraints.  In particular, I indicated that Hugh Darwen and I follow a particular classification scheme of our own in THE THIRD MANIFESTO.  I'd like to sketch that scheme briefly here. In essence, we divide constraints into database, relvar, attribute, and type constraints.  In outline: 

 

·   A database constraint is a constraint on the values a given database is permitted to assume;

·   A relvar constraint is a constraint on the values a given relvar is permitted to assume;

·   An attribute constraint is a constraint on the values a given attribute is permitted to assume;

·   A type constraint is, precisely, a definition of the set of values that constitute a given type. 

 

Earlier parts of this article didn't really discuss type constraints at all.  Such constraints are conceptually very simple, however.  For example, we might have a type constraint for type POINT (meaning points in two-dimensional space) that says in effect that something is a valid point if and only if it can be represented by a pair of numeric values x and y such that x and y are both in the range (say) -100.0 to +100.0.  Type constraints are checked as part of the execution of the corresponding selector.  I should explain that selectors are operators that allow us to "select" (or specify) an arbitrary value of the type in question.  For example, here are a couple of examples of selector invocations for type POINT: 

 

POINT (5.0, 2.5)

/* selects the point with x = 5.0, y = 2.5 */

 

POINT (xxx, yyy )

/* selects the point with x = xxx, y = yyy--*/

/* xxx and yyy are variables of type numeric  */

 

As you can see, selectors--or, more precisely, selector invocations--are a generalization of the familiar concept of a literal.  (All literals are selector invocations, but not all selector invocations are literals.  More precisely, a selector invocation is a literal if and only if all of its operands are literals in turn.) 

 

Second, attribute constraints are basically what I called "a priori constraints" in the first part of this article; in other words, an attribute constraint is basically just a statement to the effect that a specified attribute of a specified relvar is of a specified type.  For example, consider the following relvar definition once again:

 

VAR s RELATION                                               

 {s# s#,                                             

  sname name,                                           

  status INTEGER,                                        

  city CHAR}                                           

  KEY {s#};                                             

                                                                 

In this relvar: 

 

·  Values of attribute S# are constrained to be of type S#;

·  Values of attribute SNAME are constrained to be of type NAME;

·  Values of attribute STATUS are constrained to be of type INTEGER;

·  Values of attribute CITY are constrained to be of type CHAR. 

 

Finally, relvar and database constraints are what I've been concentrating on throughout most of this article.  Just to remind you, a relvar constraint is a constraint that involves exactly one relvar, while a database constraint is a constraint that involves two or more relvars.  As noted in Part I, however, the distinction between these two kinds of constraints isn't very important from a theoretical point of view (though it might sometimes be useful from a pragmatic one).  In particular, relvar and database constraints are both checked "immediately," as I explained in detail in Part II.  Of course, the same is effectively true for type and attribute constraints as well. 

 

Finally, I note that transition constraints are subsumed by the foregoing scheme.  A transition constraint is a constraint on the legal transitions that a given relvar or database can make from one value to another (for example, "amount of tax due can never decrease").  Provided we have a way to refer within a single expression to both (a) the value of the relvar or database in question before any given update and (b) the value of that same relvar or database after that update, then we have the means to formulate any required transition constraint. 

 

I'd like to finish up this article by offering a slightly more formal perspective on some of what I've been saying.  I said in Part II that a database is a collection of true propositions.  In fact, a database, together with the operators that apply to the propositions in that database, is a logical system.  And when I say "logical system" here, I mean a formal system--like euclidean geometry, for example--that has axioms ("given truths") and rules of inference by which we can prove theorems ("derived truths") from those axioms.  Indeed, it was Ted Codd's very great insight, when he first invented the relational model back in 1969, that a database isn't really just a collection of data (despite the name); rather, it's a collection of facts, or what the logicians call true propositions.  Those propositions--the given ones, which is to say the ones in the "base tables"--are the axioms of the logical system under discussion.  And the inference rules are essentially the rules by which new propositions are derived from the given ones; in other words, they're basically the operators of the relational algebra.  Thus, when the DBMS evaluates some relational expression (in particular, when it's responding to some query), it's really deriving new truths from given ones; in effect, it's proving a theorem! 

 

Once we recognize the foregoing, we see that the whole apparatus of formal logic becomes available for use in attacking "the database problem."  In other words, questions such as

 

·   What should the database look like to the user?

·   What should the query language look like?

·   How should results be presented to the user?

·   How can we best implement queries (or, more generally, evaluate database expressions)?

·   How should we design the database in the first place?

 

(and probably others) all become, in effect, questions in logic that are susceptible to logical treatment and can be given logical answers. 

 

Of course, it goes without saying that the relational model supports the foregoing perception of what databases are all about very directly--which is why, in my opinion, the relational model is rock solid, and "right," and will endure.  It's also why, again in my opinion, other "data models" are simply not in the same ball park as the relational model ... Indeed, I seriously question whether those other "models" deserve to be called models at all, in the same sense that the relational model can be called a model.*  Certainly most of them are ad hoc to a degree, instead of being firmly founded, as the relational model is, in set theory and formal logic.  In this connection, I'd like to refer you to two other articles of my own, Why 'The Object Model' Is Not a Data Model and Object Identifiers vs. Relational Keys,and especially to an article by Hugh Darwen, What a Database Really Is: Predicates and Propositions(this last is an excellent and highly readable tutorial on the idea of "database plus operators = a logical system").  All of these articles are included in the book RELATIONAL DATABASE WRITINGS 1994-1997, by Hugh Darwen, David McGoveran, and myself.

 

Finally, given that a database together with the relational operators is indeed a logical system, I'd like to come back to the primary topic of this article and reemphasize the absolutely vital importance of integrity constraints.  If the database is in violation of some integrity constraint, then the logical system we're talking about is inconsistent.  And, as I claimed in Part II of this article, you can get absolutely any answer at all from an inconsistent system.  Let me conclude by demonstrating this fact.  Suppose the system in question is such that it implies that both p and NOT p are true (there's the inconsistency), where p is some proposition.  Now let q be some arbitrary proposition.  Then: 

 

·   From the truth of p, we can infer the truth of p OR q

·   From the truth of p OR q and the truth of NOT p, we can infer the truth of q

 

But q was arbitrary!  It follows that any proposition whatsoever can be shown to be true in an inconsistent system.

 

 

Posted 05/17/02