This three-part article is
intended as a tutorial on the crucial notion of integrity constraints
(which I'll usually abbreviate to just constraints) and the associated and
equally crucial notion of predicates. As I've written elsewhere--see,
e.g., the book FOUNDATION OF
DATABASE MANAGEMENT SYSTEMS: THE THIRD MANIFESTO, by Hugh
Darwen and myself, referred to hereinafter as THE THIRD MANIFESTObook,
or just THE MANIFESTO for short--these notions are absolutely
fundamental, and yet they don't seem to be very well understood in the database
community at large. Indeed, they seem to be surrounded by confusion (not least
in the academic world, as I'll try to show in a moment). My aim in this article is to clear up some
of that confusion, if I can. (In passing, let me acknowledge helpful reviews of
earlier drafts of this article by Hugh Darwen and also by Fabian Pascal.)
Of course, I'm sure most readers
will agree that the integrity constraint concept is easy enough to understand
at an intuitive level; intuitively, a constraint is just a conditional
expression--also known as a boolean, truth-valued,
or logical expression--that's required to evaluate to true. Here
are a few simple examples, all based on the well-known suppliers and parts
database (see later for a definition of that database):
1.
Every supplier status value
is in the range 1 to 100 inclusive.
2.
Every supplier in London has
status 20.
3.
If there are any parts at
all, at least one of them is blue.
4.
No two distinct suppliers
have the same supplier number.
5.
Every shipment involves an
existing supplier.
6.
No supplier with status less
than 20 supplies any part in a quantity greater than 500.
And so on.
Now, these examples are all fairly
straightforward, and I'm sure you didn't have any difficulty in understanding
any of them. If history is anything to go by, however, pinning matters down
more precisely turns out to be a little trickier than might have been
expected. In support of this claim, I
could cite among other things the numerous attempts--most of them not very
successful--to come up with a sensible taxonomy or classification scheme for
constraints. I've made several such
attempts myself!--see, e.g., my RELATIONAL DATABASE
WRITINGS series, especially the 1991-1994 and 1994-1997
volumes and the two editions, coauthored with Hugh Darwen, of THE THIRD MANIFESTO
book mentioned above. Other writers who have also tried to come up with
classification schemes include:
·
Ted Codd (see his book THE RELATIONAL MODEL FOR
DATABASE MANAGEMENT VERSION 2, Addison-Wesley, 1990);
·
Mike Stonebraker (one of the earliest attempts, so far
as I'm aware--see his paper on the subject in the 1975 ACM SIGMOD Conference
Proceedings);
·
Jeff Ullman and Jennifer Widom (see their book A
FIRST COURSE IN DATABASE SYSTEMS, Prentice Hall, 1997);
·
Ralph Kimball (see his articles There Are No
Guarantees and Enforcing the Rules in Intelligent Enterprise
3, Nos. 11 and 12, August 1st and 18th, 2000, respectively);
and several others.
Note: In this connection,
I simply can't resist mentioning in particular the book UNDERSTANDING THE
NEW SQL: A COMPLETE GUIDE, by Jim Melton and Alan R. Simon (Morgan
Kaufmann, 1993). That book has a
chapter entitled "Constraints, Assertions, and Referential
Integrity." What would you think of a book on biology that had a chapter
entitled "Birds, Feathered Bipeds, and Woodpeckers"? The parallel is
exact. In fairness, I should perhaps say that the SQL standard itself might be
partly to blame here, inasmuch as it uses the keyword CONSTRAINT in connection
with some constraints and the keyword ASSERTION in connection with others. (Most constraints can be
specified using
either the CONSTRAINT style or the ASSERTION style, but some have to use
the CONSTRAINT style and some have to use the ASSERTION style. I have no idea why this is.)
I'd like to go further and offer
evidence that suggests that a fair number of database professionals, including
in particular authors of textbooks, don't seem even to appreciate the
fundamental nature and importance of integrity in a database context. A quick and admittedly not
very scientific
survey of a whole shelfload of database textbooks--37 in all, including
essentially all of the best-known ones--reveals the following: (It
wouldn't have been appropriate to include any of my own books in the survey,
and I didn't. I also didn't include Fabian Pascal's book PRACTICAL ISSUES IN
DATABASE MANAGEMENT, since I reviewed and commented on that
book in draft form and also wrote a foreword for it, and might therefore be
accused of bias. Let me simply note for the record that Fabian's book, unlike
most of the others, does include a chapter on integrity.
·
Only one book had an entire chapter devoted to the
topic of integrity--and even there I have severe reservations about the quality
of the treatment. Note: At first glance it looked as if three others
had a whole chapter on the subject, too, but closer examination revealed that
one of those books was using the term to refer to normalization issues only,
while the other two were taking it to refer, not to integrity in its usual
sense at all, but rather to locking and concurrency control issues. Caveat lector!
·
Most of the books examined didn't even mention
integrity in a chapter title at all, and those few that did tended to bundle it
with other topics in what seems to me a very haphazard fashion
("Integrity, Views, Security, and Catalogs" is a typical example).
·
I couldn't find a good explanation or definition of the
concept, let alone the kind of emphasis I think the concept deserves, in any of
the books at all.
With all of the foregoing in mind,
I offer this brief tutorial as an aid to clarification and precise thinking in
this potentially confusing area.
Let me begin by observing that,
logically speaking, a database is a variable--a large and complicated
variable, perhaps, but a variable nonetheless. Certainly it's something that
has different values at different times, and that's more or less the definition
of a variable. Thus, the operation of "updating the database" causes
the current value of the variable that's the database in question to be
replaced by another value. And, of course, the values I'm talking about here
are both database values, and the variable is a database variable.
In other words, the crucial distinction between values and variables that I've
discussed in general terms in this occasional series of articles several times
already-- e.g. Is a Circle an Ellipse? (forthcoming on this site)
applies to databases in particular.
Likewise, the relations in a given
database are also variables: They too have different values at different times.
In fact, in THE THIRD MANIFESTO,
Hugh Darwen and I explicitly introduced the term relation variable,
distinguishing it from the term relation value (relation values are the
kinds of values that relation variables are permitted to have). Historically, of course, we've
typically
used the same term, 'relation' (or, in SQL contexts, 'table'), for both
concepts; but this usage has demonstrably led to a lot of confusion in the
past, and so we decided in THE THIRD MANIFESTO that we really did need
two different terms. For simplicity, however, we also decided to allow those
terms to be abbreviated whenever it seemed safe to do so: relation variable to
relvar, and relation value to just relation (exactly as we
conventionally allow, e.g., the term 'integer value' to be abbreviated to just
'integer').
Note: In the same kind of way, it would make good
logical sense to talk
explicitly of database values and database variables (probably
allowing those terms to be abbreviated to databases and dbvars,
respectively); a database variable would be a container for a collection of
relation variables (only!), and a database value would be the value of a
database variable at some particular time.
Slightly against my better judgment, however, in this article I've
decided to stick to the simple and familiar term 'database' for both
concepts. But I'll definitely reserve
the term 'relation' to mean a relation value specifically, and I'll use the
term 'relvar' whenever I mean a relation variable specifically.
With all of that preamble out of
the way, let's get down to the subject at hand. As usual, I'll base my examples on the suppliers
and parts
database, so let me first give an appropriate definition for that database.
(That definition--which should, I hope, be more or less self-explanatory--is expressed
in a slightly simplified form of Tutorial D, which is a language Hugh
Darwen and I use as a basis for examples in our THIRD MANIFESTO book and
elsewhere. For simplicity, I've omitted
the underlying type (or domain) definitions, showing only the three
relvar definitions--but I'm going to assume, where it makes any difference,
that types INTEGER (integers) and CHAR (character strings of arbitrary length)
are built in or system-defined types, while types S#, NAME, P#, COLOR, WEIGHT,
and QTY are user-defined types.
VAR S
RELATION
{S# S#,
SNAME NAME,
STATUS INTEGER,
CITY CHAR}
KEY {S#};
VAR P RELATION
{P# P#,
PNAME NAME,
COLOR COLOR,
WEIGHT WEIGHT,
CITY CHAR}
KEY {P#};
VAR SP
RELATION
{S# S#,
P# P#,
QTY QTY}
KEY {S#,P#}
FOREIGN KEY {S#} REFERENCES S
FOREIGN KEY {P#} REFERENCES P;
Now we can start to look at
integrity constraints as such. In the
most general terms, an integrity constraint is a constraint on the values some
given variable is permitted to assume.
Thus, the very fact that a given variable is defined to be of some given
type represents an a priori constraint on the variable in question--the
values that can be assumed by that variable obviously have to be values of that
type. And, as a special case of the
foregoing, the very fact that each attribute of a given relvar is defined to be
of some given type represents an a priori constraint on the relvar in question. For example, relvar
S (suppliers) is
constrained to contain values that are relations in which every S# value is a
supplier number (a value of type S#), every SNAME value is a name (a value of
type NAME), and so on.
However, these simple a priori
constraints are certainly not the only possible ones; in fact, none of the
examples given at the beginning of this article was an a priori constraint in
the foregoing sense. Let's take a
closer look at the first of those earlier examples:
1. Every supplier status value is in the range 1 to
100
inclusive.
Here's an alternative way of
saying more or less the same thing: (My
reason for giving this alternative statement should become clear in a few
moments.
If s is a supplier, then s
has a status value in the range 1 to 100 inclusive
Now, the foregoing
natural-language statements are obviously quite informal. A more formal and precise formulation
might
look something like this:
FORALL s#*S#, sn*NAME,st*INTEGER,sc*CHAR
(IF {S# s#,SNAME sn,STATUS
st,CITY sc}*S
THEN st*1 AND st*100)
Before I try to explain this
expression in detail, let me say that--for reasons that are beyond the scope of
this article--the syntax I'll be using in my formal examples is not exactly Tutorial
D syntax as such, though it is close to a relational calculus version
of that syntax. In particular, it makes
use of the truth-valued operators (the technical term is quantifiers)
FORALL and EXISTS. To elaborate
briefly:
·
The expression FORALL x (p) evaluates to true if and
only if the truth-valued expression p evaluates to true for all values of x.
·
The expression EXISTS x (p) evaluates to true if and
only if the truth-valued expression p evaluates to true for at least one value
of x.
Note: The
"x" in both FORALL x (p) and EXISTS x (p)
is an example of what's called a bound variable. I'll have more to say about bound variables
in Parts II and III of this article.
My syntax also makes use of the
truth-valued operator "*", which can be read as 'is a member of' or 'belongs
to', or simply 'in' (it does resemble SQL's IN operator, somewhat).
Now, I think the various formal
examples (the one shown above and all of the others to follow) are all pretty
much self-explanatory; however, if you'd like further explanation of FORALL,
EXISTS, "*", and related matters, then I refer you to my book AN INTRODUCTION TO
DATABASE SYSTEMS.
Back to the example. It follows from all of the above
that the
formal version of the constraint can be read as follows (in rather stilted
English):
For all supplier numbers s#
and all names sn and all integers st and all character strings sc,
if a tuple with S# = s# and SNAME = sn and STATUS = st and
CITY = sc appears in the suppliers relvar, then st is greater
than or equal to 1 and less than or equal to 100.
Perhaps you can now see why I gave
that alternative natural-language statement of the constraint when I originally
began to discuss this example. The fact
is, that alternative natural-language statement, the formal and precise
expression, and the stilted English version all have a certain overall
"shape," as it were, that looks something like this: (I don't mean to
say that all constraints have this particular shape, just that many of
them do (see Example 3, later, for an example of one that doesn't).
IF a certain tuple appears
in a certain relvar, THEN that tuple satisfies a certain condition.
This "shape" is an
example of what's known in logic as an implication (sometimes a logical
or material implication). In
general, an implication takes the form
IF p
THEN q
where p
and q are truth-valued expressions, called the antecedent and the
consequent, respectively.
Like the antecedent and the consequent, the overall implication is
itself a truth-valued expression; it evaluates to false if p is true
and q is false, and to true otherwise (in other words,
it's logically equivalent to the expression (NOT p) OR q).
By the
way, note how the foregoing shape implies the FORALL
quantification--"IF a certain tuple appears" means, implicitly,
"FORALL tuples that do appear."
Suppose,
then, that some given constraint does in fact have the foregoing shape. Then it's specifically the
"certain
condition" that appears in the consequent of the implication that most
people would regard, informally, as the integrity constraint per se. In SQL, for example,
the constraint under
discussion--"If s is a supplier, then s has a status value
in the range 1 to 100 inclusive"--would typically be expressed as
follows:
CHECK (STATUS BETWEEN 1 AND 100)
This
expression would appear as part of the definition of relvar S, or what SQL
would call "base table" S.
And it does strongly suggest that the "condition"--i.e., the
expression between parentheses--is the whole of the constraint (after all, it
does say that's what's to be checked, doesn't it?).
However,
while the foregoing perception (that the "condition" is whole of the
constraint) might perhaps be un-objectionable in informal contexts, and
possibly even desirable for human factors reasons, it shouldn't be allowed to
obscure the fact that there's really a lot more going on. That is, the SQL expression shown above
must
be understood as shorthand for something like the more explicit, and more
complete, formal expression shown earlier.
(As an
aside, I note that SQL doesn't actually include any direct support for logical
implication--i.e., for truth-valued expressions of the form IF p THEN q. This fact
might go some way toward explaining
why the SQL formulation of the constraint takes the form it does.)
To
conclude Part I of this article, let's take a look at the rest of the
introductory examples. I'll show a
precise formulation in each case and offer some comments where I think they're
needed, but I'll leave the "stilted English" and SQL formulations to
you. Please note that I make no claim
that the precise formulations shown are unique, or even that they're as simple
as possible; I claim only that they're correct. Note too that each example does illustrate
at least one new point.
2. Every
supplier in London has status 20.
FORALL s#*S#,sn*NAME,st*INTEGER,sc*CHAR
(IF {S# s#,SNAME sn,STATUS
st,CITY sc}*S
THEN (IF sc =
'London'
THEN st =
20))
In this example, the consequent of
the implication takes the form of an implication itself.
3. If there are any parts at all,
at least one of them is blue.
IF
EXISTS
p#*P#,pn*NAME,pl*COLOR,pw*WEIGHT,pc*CHAR
({P# p#,PNAME pn,COLOR
pl,WEIGHT pw,CITY pc}*P)
THEN
EXISTS
p#*P#,pn*NAME,pl*COLOR,pw*WEIGHT,pc*CHAR
({P# p#, PNAME pn,COLOR
pl,WEIGHT pw,CITY pc}*P
AND pl = COLOR
('Blue'))
The expression "COLOR
('Blue')" in the last line but one here is a COLOR literal. Note that we can't just say
"at least
one part is blue"--we do have to worry about the case where there aren't
any parts at all.
I remark in passing that SQL tries
to "help" the user with the formulation of constraints of this kind
but (predictably enough) makes a hash of things ... To be specific, if the clause
CHECK(q) is specified as part of the definition of base table BT,
but BT happens to be empty, then SQL regards the constraint as satisfied no
matter what form the truth-valued expression q takes--even if it takes the
form "BT must not be empty"! (or even the form "BT must have
cardinality -5," or the form "1 = 0," come to that).
4. No two distinct suppliers have
the same supplier number.
FORALL x#*S#,xn*NAME,xt*INTEGER,xc*CHAR,
y#*S#,yn*NAME,yt*INTEGER,yc*CHAR
(IF {S# x#,SNAME xn,STATUS
xt,CITY xc}*S AND
{S# y#,SNAME yn,STATUS
yt,CITY yc}*S
THEN (IF x# = y#
THEN xn = yn
AND xt = yt AND xc = yc))
This expression is just a formal
statement of the fact that {S#} is a candidate key for suppliers; thus,
candidate key constraints are just a special case of constraints in general (It
would be more accurate to say that it's a formal statement of the fact that
{S#} is a superkey for suppliers.
A superkey is a superset--not necessarily a proper superset, of
course--of a candidate key; thus, all candidate keys are superkeys, but some
superkeys are not candidate keys. But I
don't want to get too far off on this particular tangent here). The Tutorial
D syntax KEY {S#} might be regarded
as shorthand for the foregoing more long winded expression. Note: In practice, of course,
our list of constraints ought also to
include an analogous one for shipments, saying that no two distinct shipments
have the same supplier number and the same part number, but I've omitted that
one for simplicity.
5. Every shipment involves an
existing supplier.
FORALL s#*S#,p#*P#,q*QTY
(IF {S# s#,P# p#,QTY
q}*SP
THEN EXISTS sn*NAME,st*INTEGER,sc*CHAR
({S# s#,SNAME
sn,STATUS st,CITY sc}*S))
This expression is a formal
statement of the fact that {S#} is a foreign key for shipments, matching the
candidate key {S#} for suppliers; thus, foreign key constraints too are just a
special case of constraints in general.
Note: In practice, of
course, our list of constraints ought really to include one to say that every
shipment involves an existing part as well, but again I've omitted that one for
simplicity.
This example is the first we've
seen to involve two distinct relvars (SP and S, in the example)--all the
previous examples involved just one. In
the THIRD MANIFESTO
book and elsewhere, I've used the term relvar constraint for a
constraint involving exactly one relvar and the term database constraint for
a constraint involving more than one.
However, while that distinction might be useful from a pragmatic
standpoint, it's not very important from a theoretical one, and I won't bother
with it much in the rest of this article.
6. No supplier with status less
than 20 supplies any part in a quantity greater than 500.
FORALL s#*S#,sn*NAME,st*INTEGER,sc*CHAR,
p#*P#,q*QTY
(IF {S# s#,SNAME sn,STATUS
st,CITY sc}*S AND
{S# s#,P# p#,QTY
q}*SP
THEN st*20 OR
QTY*500)
Like the previous example, this
one too involves two distinct relvars (S and SP again, of course), but it's
certainly not a foreign key constraint.
(To be continued in Part 2).
Posted 05/03/02