In Parts 1 and 2 of this article,
I showed among other things that each relvar has a relvar predicate and the
overall database has a database predicate. Now, I hope it's obvious that
the predicates in question are all ones that are "understood by the
DBMS": They're stated formally (they're part of the database
definition, in fact), and of course they're enforced by the DBMS,
too. For such reasons, it's convenient to refer to the predicates in
question, on occasion, as internal predicates specifically--principally
because relvars and databases also have external predicates, which I
want to discuss next.
The first and most significant
point is that, while internal predicates are a formal construct,
external predicates are an informal construct merely. Internal predicates are (loosely)
what
the data means to the DBMS; external predicates, by contrast, are what the
data means to the user. Of
course, users have to understand the internal predicates as well as the
external ones (why?), but, to repeat, the DBMS has to understand--indeed, can
only understand--the internal ones. In
fact, we might say, loosely, that a given internal predicate is the DBMS's
approximation to the corresponding external predicate (see later).
Now let's concentrate on relvars
specifically (I'll come back and deal with the overall database later). As just
suggested, the external predicate for a given relvar is basically what that
relvar means to the user. In the
case of the suppliers relvar S, for example, the external predicate might look
something like this:
The supplier with the specified
supplier number (S#) has the specified name (SNAME) and the specified status
(STATUS) and is located in the specified city (CITY). Moreover, the status value is in the range 1
to 100 inclusive,
and must be 20 if the city is London.
Also, no two distinct suppliers have the same supplier number.
For the sake of the discussion
that follows, however, I'm going to replace this predicate by the following
simpler one:
Supplier S# is named SNAME, has
status STATUS, and is located in CITY.
After all, the external predicate
is (as already stated) only informal, so we're at liberty to make it as simple
or as complex as we like.
Now, note that the foregoing
statement is indeed a predicate (i.e., it's a truth-valued function): It has four parameters,
viz., S#,
SNAME, STATUS, and CITY, corresponding to the four attributes of the relvar,
and when arguments of the appropriate types are substituted for those
parameters, we obtain a proposition, or something that's categorically either true
or false.
Note: You might notice that I'm
using the term 'parameter' now in a sense slightly different from--though not
violently different from--the sense in which I was using it in Part II of this
article. In Part II, parameters (and
corresponding arguments) denoted whole relations; now, they denote individual
attribute values. In fact, the
parameters I'm talking about now correspond very closely to those bound
variables that appeared in the formal constraints I showed in Part I of
this article.
Each tuple appearing in relvar S
at any given time can now be regarded as denoting a certain proposition,
obtained by instantiating the foregoing external predicate. As already noted, the instantiation
involves
substituting arguments of the appropriate type for the parameters of the
external predicate. For example, if we
substitute the S# value S1, the NAME value Smith, the INTEGER value 20, and the
CHAR value London for the relevant parameters, we obtain the proposition
Supplier S1 is
named Smith, has status 20, and is located in London.
Likewise, if we substitute the S#
value S2, the NAME value Jones, the INTEGER value 10, and the CHAR value Paris,
we obtain
Supplier S2 is
named Jones, has status 10, and is located in Paris.
And so on.
To repeat, tuples of the relvar
correspond to instantiations of the external predicate. And--very important!--those particular
instantiations (i.e., those particular propositions) are ones that are
understood by convention to be ones that evaluate to true. For example, if the tuple
{S# S#('S1'),SNAME NAME('Smith'),STATUS 20,CITY 'London'}
indeed does appear in relvar S at
some time, then we're to understand that it's a "true fact" that
there does exist at that time a supplier with supplier number S1, named Smith,
with status 20, and located in London (Yes, I know that facts are true by
definition and "true fact" is a pleonasm, but--like Edward Abbey--I
like the emphasis). More generally:
IF (s * S) = true
THEN XPS (s) = true
Here:
·
s is a tuple of the form { S# s#, SNAME sn,
STATUS st, CITY sc }, where s# is a value of type S#, sn
is a value of type NAME, st is a value of type INTEGER, and sc is
a value of type CITY);
·
XPS is the external predicate for suppliers;
·
XPS(s) is the proposition obtained by
instantiating XPS with the argument values S# = s#, SNAME = sn,
STATUS = st, and CITY = sc.
However, we go a step further with
external predicates than we do with internal ones. To be specific, we adopt the Closed World
Assumption,
which says that if an otherwise valid tuple does not appear in the
relvar at some time, then the corresponding proposition is understood by
convention to be one that evaluates to false at that time. For example, if the tuple
{S# S#('S6'),SNAME NAME('Lopez'),STATUS 30,CITY 'Madrid'}
does not appear in relvar S
at some particular time, then we're to understand that it's not the case that
there exists at that time a supplier with supplier number S6, named Lopez, with
status 30, and located in Madrid. More
generally:
IF NOT ((s
* S) = true) THEN NOT (XPS (s) = true)
Or more succinctly:
IF NOT (s
* S) THEN NOT XPS (s)
Putting the foregoing together, we
have simply:
s * S * XPS
(s)
In words: A given tuple appears in a given relvar at a
given time if and only if that tuple causes that relvar's external
predicate to evaluate to true at that time (the symbol "*" can
be read as "if and only if" or "is equivalent to"). It follows that a given
relvar contains all
and only the tuples that correspond to true instantiations of
that relvar's external predicate (at the particular time in question).
I've said that external
predicates, and the propositions obtained by instantiating such predicates,
aren't--and in fact can't be--"understood by the DBMS." There's no way, for example, that
the DBMS
can know what it means for a "supplier" to "be located"
somewhere, or what it means for a "supplier" to "have a status"
(etc.). All such issues are matters of interpretation;
they're matters that make sense to the user but not to the system. For example, if the supplier
number S1 and
the city name London happen to appear together in the same tuple, then the user
can interpret that fact to mean that supplier S1 is located in London, but (to
repeat) there's no way the system can do anything analogous.
What's more, even if the DBMS could
know what it means for a supplier to be located somewhere, it still couldn't
know a priori whether what the user tells it is true! If the user tries to assert to the
system
(e.g., by executing an INSERT operation) that supplier S1 is located in London,
there's no way for the DBMS to know whether that assertion is true. All the DBMS can do is make
sure that the
user's assertion doesn't lead to any constraint violations (i.e., it
doesn't cause any internal predicate to evaluate to false). Assuming it doesn't, then the
DBMS must
accept the user's assertion and treat it as true from that point forward
(or, rather, until the user tells the DBMS--typically via a corresponding
DELETE operation--that it isn't true any more, of course).
By the way, the foregoing
paragraph shows clearly why the Closed World Assumption doesn't apply to
internal predicates. To be specific, a
tuple might satisfy the internal predicate for a given relvar and yet validly
not appear in that relvar, because it doesn't correspond to a true proposition
in the real world.
We can summarize all of the above
by saying that, informally, the external predicate for a given relvar is the
intended interpretation for that relvar.
As such, it's important to the user, but not (very) important to the
DBMS. We can also say, again
informally, that the external predicate for a given relvar is the criterion
for acceptability of updates on the relvar in question--i.e., it dictates, at least in
principle, whether a
particular INSERT or DELETE or UPDATE operation on that relvar can be allowed
to succeed. Ideally, therefore, the
DBMS would know and understand the external predicate for every relvar, so that
it could deal correctly with all possible attempts to update the database. Unfortunately, however,
we've already seen
that this goal is unachievable; that is, the DBMS cannot know any given
external predicate. But it does know
a good approximation: It knows the
corresponding internal predicate--and that's what it will enforce. (Thus, the
pragmatic "criterion
for acceptability of updates," as opposed to the ideal one, is the internal
predicate, not the external one.)
Another way of saying the same thing is this:
The DBMS can't enforce truth,
only consistency.
That is, the DBMS can't guarantee
that the database contains only true propositions--all it can do is guarantee
that it doesn't contain anything that causes any integrity constraint to be
violated (i.e. it doesn't contain any inconsistencies). Sadly, truth and consistency aren't the
same
thing! Indeed, we can observe
that:
·
If the database contains only true propositions, then
it's consistent, but the converse isn't necessarily so.
·
If the database is inconsistent, then it contains at
least one "false fact," but the converse isn't necessarily so.
More succinctly: Correct implies consistent
(but
not the other way around), and inconsistent implies incorrect
(but not the other way around). By my
use of the term correct here, of course, I mean to say that the database
is correct if and only if it fully reflects the true state of affairs in the
real world.
I've said that external predicates
are important to the user. Why? Well, because (as previously stated) those
external predicates represent the intended interpretation for the corresponding
relvars. The point is this: Clearly, each relvar is supposed to represent
some aspect of the real world--in other words, each relvar is supposed to have
a meaning--and users need to know those meanings (i.e., those intended
interpretations) if they're to be able to use the database effectively and
efficiently. That's all.
There's another point to be made
here, though. The fact is, all this
business of external predicates is very closely related to the business of
(logical) database design.
Indeed, at a certain rather high level of abstraction (and ignoring the
fact that, in practice, the process is highly iterative, as we all know), we
can characterize the design process as consisting of the following two
steps:
1.
Pin down the external predicates as carefully as
possible--albeit informally, of course.
2.
Map the output from the first step into a set of formal
relvars and corresponding internal predicates.
Which brings me to a small piece
of unfinished business. I've now
discussed external relvar predicates at some length. What about the overall database? Well,
as I'm sure you've guessed, the
external predicate for the overall database--which represents the intended
interpretation for that database--is simply the conjunction (i.e., the logical
AND) of the external predicates for all of the relvars in that database.
There's another important point
that I've been ducking so far but I now need to spell out, as follows: Basically, everything I've
been saying
applies to all relvars, not just to what THE THIRD MANIFESTO
calls real ones (i.e., what SQL would call "base
tables"). In particular, it
applies to views ("virtual relvars," as THE THIRD MANIFESTOcalls them).
Thus, views too are
subject to constraints, and they have relvar predicates, both internal and
external. For example, suppose we
define a view--let's call it SST--by projecting the suppliers relvar over
attributes S#, SNAME, and STATUS (thereby effectively removing attribute CITY). Then the
external predicate for view
SST looks something like this:
There exists some city CITY such
that supplier S# is named SNAME, has status STATUS, and is located in CITY.
Observe that, as required, this
predicate does have three parameters, not four, corresponding to the three
attributes of relvar SST (CITY is now no longer a parameter but a bound
variable instead, thanks to the fact that it is quantified by the phrase
"there exists some city").
Another, perhaps clearer, way of making the same point is to observe
that the predicate as stated is logically equivalent to this one:
Supplier S# is
named SNAME, has status STATUS, and is located in some city.
This version of the predicate very
clearly has just three parameters.
What about the internal
predicate? Here again are the six
constraints from Part I of this article:
1.
Every supplier status value is in the range 1 to 100
inclusive.
2.
Every supplier in London has status 20.
3.
If there are any parts at all, at least one of them is
blue.
4.
No two distinct suppliers have the same supplier number.
5.
Every shipment involves an existing supplier.
6.
No supplier with status less than 20 supplies any part in a
quantity greater than 500.
Of these six, the third is clearly
irrelevant as far as view SST is concerned, since it has to do with parts, not
suppliers. As for the rest, each of
them does also apply to view SST, but in a slightly modified form. Here, for example, is the
modified form of
Number 5:
FORALL s#*S#,p#*P#,q*QTY
(IF {S# s#,P# p#,QTY
q}*SP
THEN EXISTS sn*NAME,st*INTEGER
({S# s#,SNAME sn,STATUS
st}*SST))
The changes are in the third and
fourth lines: All references to CITY
have been dropped, and the reference to S has been replaced by a reference to
SST. Note that we can regard this
constraint for SST as being derived from the corresponding constraint
for S, much as relvar SST itself is derived from relvar S. Note too that (to use SQL terminology)
this
constraint for SST is effectively a foreign key constraint from a base table to
a view!
Analogous remarks apply directly
to Numbers 1, 2, 4, and 6. Number 2 is
slightly tricky, though. Here it
is:
FORALL
s#*S#,sn*NAME,st*INTEGER
(IF {S# s#,SNAME sn,STATUS st}*SST
THEN EXISTS sc * CHAR
({S# s#,SNAME sn,STATUS st,CITY
sc}*S AND
(IF sc = 'London'
THEN st = 20))
Again, however, we can regard this
constraint as being derived from the corresponding constraint for S.
In Part I of this article, I
mentioned that several people have tried to come up with a taxonomy or
classification scheme for constraints.
In particular, I indicated that Hugh Darwen and I follow a particular
classification scheme of our own in THE THIRD MANIFESTO.
I'd like to sketch that scheme briefly here.
In essence, we divide constraints into database, relvar, attribute, and type
constraints. In outline:
·
A database constraint is a constraint on the values a
given database is permitted to assume;
·
A relvar constraint is a constraint on the values a
given relvar is permitted to assume;
·
An attribute constraint is a constraint on the values a
given attribute is permitted to assume;
·
A type constraint is, precisely, a definition of the
set of values that constitute a given type.
Earlier parts of this article
didn't really discuss type constraints at all. Such constraints are conceptually very
simple, however. For example, we might have a type constraint
for type POINT (meaning points in two-dimensional space) that says in effect
that something is a valid point if and only if it can be represented by a pair
of numeric values x and y such that x and y are
both in the range (say) -100.0 to +100.0.
Type constraints are checked as part of the execution of the
corresponding selector. I should
explain that selectors are operators that allow us to "select" (or
specify) an arbitrary value of the type in question. For example, here are a couple of examples of
selector
invocations for type POINT:
POINT (5.0, 2.5)
/* selects the point with x = 5.0, y = 2.5 */
POINT (xxx, yyy )
/* selects the point with x = xxx, y = yyy--*/
/* xxx and yyy are variables of type numeric */
As you can see, selectors--or,
more precisely, selector invocations--are a generalization of the
familiar concept of a literal. (All
literals are selector invocations, but not all selector invocations are
literals. More precisely, a selector
invocation is a literal if and only if all of its operands are literals in
turn.)
Second, attribute constraints
are basically what I called "a priori constraints" in the first
part of this article; in other words, an attribute constraint is basically just
a statement to the effect that a specified attribute of a specified relvar is
of a specified type. For example,
consider the following relvar definition once again:
VAR s RELATION
{s# s#,
sname name,
status INTEGER,
city CHAR}
KEY {s#};
In this relvar:
· Values
of attribute S# are constrained to be of type S#;
· Values
of attribute SNAME are constrained to be of type NAME;
· Values
of attribute STATUS are constrained to be of type INTEGER;
· Values
of attribute CITY are constrained to be of type CHAR.
Finally, relvar and database
constraints are what I've been concentrating on throughout most of this
article. Just to remind you, a relvar
constraint is a constraint that involves exactly one relvar, while a database
constraint is a constraint that involves two or more relvars. As noted in Part I, however, the
distinction
between these two kinds of constraints isn't very important from a theoretical
point of view (though it might sometimes be useful from a pragmatic one). In particular, relvar and
database
constraints are both checked "immediately," as I explained in detail
in Part II. Of course, the same is
effectively true for type and attribute constraints as well.
Finally, I note that transition
constraints are subsumed by the foregoing scheme. A transition constraint is a constraint on
the legal transitions
that a given relvar or database can make from one value to another (for
example, "amount of tax due can never decrease"). Provided we have a way to refer within
a
single expression to both (a) the value of the relvar or database in question before
any given update and (b) the value of that same relvar or database after
that update, then we have the means to formulate any required transition
constraint.
I'd like to finish up this article
by offering a slightly more formal perspective on some of what I've been
saying. I said in Part II that a
database is a collection of true propositions.
In fact, a database, together with the operators that apply to the
propositions in that database, is a logical system. And when I say "logical
system"
here, I mean a formal system--like euclidean geometry, for example--that
has axioms ("given truths") and rules of inference by
which we can prove theorems ("derived truths") from those
axioms. Indeed, it was Ted Codd's very
great insight, when he first invented the relational model back in 1969, that a
database isn't really just a collection of data (despite the name);
rather, it's a collection of facts, or what the logicians call true
propositions. Those propositions--the
given ones, which is to say the ones in the "base tables"--are the axioms
of the logical system under discussion.
And the inference rules are essentially the rules by which new
propositions are derived from the given ones; in other words, they're basically
the operators of the relational algebra. Thus, when the DBMS evaluates some relational
expression (in
particular, when it's responding to some query), it's really deriving new
truths from given ones; in effect, it's proving a theorem!
Once we recognize the foregoing,
we see that the whole apparatus of formal logic becomes available for use in
attacking "the database problem."
In other words, questions such as
·
What should the database look like to the user?
·
What should the query language look like?
·
How should results be presented to the user?
·
How can we best implement queries (or, more generally,
evaluate database expressions)?
·
How should we design the database in the first place?
(and probably others) all become,
in effect, questions in logic that are susceptible to logical treatment and can
be given logical answers.
Of course, it goes without saying
that the relational model supports the foregoing perception of what databases
are all about very directly--which is why, in my opinion, the relational model
is rock solid, and "right," and will endure. It's also why, again in my opinion, other
"data models" are simply not in the same ball park as the relational
model ... Indeed, I seriously question whether those other
"models" deserve to be called models at all, in the same sense that
the relational model can be called a model.*
Certainly most of them are ad hoc to a degree, instead of being
firmly founded, as the relational model is, in set theory and formal
logic. In this connection, I'd like to
refer you to two other articles of my own, Why 'The Object Model' Is Not a
Data Model and Object Identifiers vs. Relational Keys,and
especially to an article by Hugh Darwen, What a Database Really Is:
Predicates and Propositions(this last is an excellent and highly
readable tutorial on the idea of "database plus operators = a logical
system"). All of these articles
are included in the book RELATIONAL DATABASE
WRITINGS 1994-1997, by Hugh Darwen, David McGoveran, and
myself.
Finally, given that a database
together with the relational operators is indeed a logical system, I'd like to
come back to the primary topic of this article and reemphasize the absolutely
vital importance of integrity constraints.
If the database is in violation of some integrity constraint, then the
logical system we're talking about is inconsistent. And, as I claimed in Part II of this article,
you can get absolutely
any answer at all from an inconsistent system. Let me conclude by demonstrating this fact.
Suppose the system in question is such that
it implies that both p and NOT p are true (there's the
inconsistency), where p is some proposition. Now let q be some arbitrary
proposition. Then:
·
From the truth of p, we can infer the truth of p
OR q.
·
From the truth of p OR q and the truth of
NOT p, we can infer the truth of q.
But q was arbitrary! It follows that any
proposition whatsoever
can be shown to be true in an inconsistent system.
Posted 05/17/02