As tedious as a
twice-told tale.
--William
Shakespeare
Preliminaries
What follows is far from new; I just feel, sadly, that it
needs to be aired still one more time... The immediate cause for my feeling
this way was a letter I received recently from a colleague in Finland, Lauri
Pietarinen, which read in part as follows:
"I just came back from the [September 2001] VLDB
Conference in Rome ... I was talking to <name suppressed to protect
the guilty> about why we should stick to "real" relations
instead of the current tuple-bag implementations ... It was his opinion that
from the optimization point of view we have nothing to gain from using
relations as compared to the current situation. He referred to the book DATABASE
SYSTEM IMPLEMENTATION by Hector Garcia-Molina, Jeffrey D. Ullman, and
Jennifer Widom, where he said a tuple-bag algebra had been presented ... Could
you please summarize the reasons for using relations ... I know you have
addressed this matter in numerous writings, but just to summarize the issue (?)
... The point of view mainly of interest was that of optimization (that is, can
we do more optimization with relations?)."
Well, it's certainly true that I've "addressed this
matter in numerous writings" -- and I really don't think I can do better
to "summarize the issue" than repeat some material that originally
appeared in my regular column for Database Programming & Design, in
Installments Nos. 17 (Toil and Trouble DBP&D 7, No. 1, January 1994)
and 34 (And Cauldron Bubble" DBP&D 8, No. 6, June 1995).
(These two installments were republished-- the second under its original title The
Department of Redundancy Department--by permission of Miller Freeman Inc.
in my books RELATIONAL DATABASE
WRITINGS 1991-1994 and RELATIONAL
DATABASE WRITINGS 1994-1997, respectively.) The two-part article
that follows, therefore, is for the most part just a lightly edited version of
those two installments. However, it does also contain some new material,
including in particular an appendix to the second part that discusses the
tuple-bag algebra of Garcia-Molina et al. in some detail. As requested, that
appendix does consider the implications of that algebra for optimization.
Note:As we'll see from
that discussion--as well as from material in Part 1 of the article--the answer
to Lauri Pietarinen's question ("Can we do more optimization with
relations?") is very definitely yes.
Introduction
I've stated in many places and on many occasions that
duplicate rows (hereinafter usually abbreviated to just duplicates) are, and
always were, a mistake in SQL. In fact, it's my position that (a) duplicate
rows should never have been permitted in the first place, but (b) given that
they were permitted, they ought to be avoided in practice. In what
follows, I give some solid reasons for this position.
The Cat Food Example
The following extract from an article by David Beech (New
Life for SQL, Datamation, February 1st, 1989) is typical of the
arguments that are advanced by duplicate-row advocates.
“For example, the row "cat food 0.39" could appear
three times [on a supermarket checkout receipt] with a significance that would
not escape many shoppers ... At the level of abstraction at which it is useful
to record the information, there are no value components that distinguish the
objects. What the relational model does is force people to lower the level of
abstraction, often inventing meaningless values to be inserted in an extra
column whose purpose is to show what we knew already, that the cans of cat food
are distinct.”
Apart from the remark regarding "lowering the level of
abstraction," which I think is just arm-waving, this seems to me to be
exactly the straw-man argument I gave in my article Why Duplicate Rows Are
Prohibited (in RELATIONAL DATABASE
WRITINGS 1985-1989, under the heading Why Duplicates Are
Good (?) -- to wit:
1. Duplicates occur naturally in practice.
2. Given that this is so, it's a burden on the user to have
to invent some artificial identifier in order to distinguish between them.
Note: If anything, in fact, the relational representation raises
the level of abstraction, because it eliminates irrelevant details--details,
that is, that have to do merely with the way data is presented to the
end user. After all, the checkout
receipt is really just a report (it certainly isn't part of the
database!), and there's no particular reason why we should have to represent
data in the same way in the database and in a report; in fact, there are good
reasons not to. (Thanks to Fabian Pascal for these observations, and more
generally for his review of--and helpful comments on--an earlier draft of this
article.)
In a subsequent section of that same article, entitled Why
Duplicates Are Bad: The Fundamental Issue, I went on to refute this
argument as follows:
1.
Individual objects must be identifiable (that is, distinguishable
from all other objects) -- for if an object is not identifiable, then it's
impossible even to talk about it, let alone perform any kind of operation upon
it or use it for any sensible purpose. In other words, objects must have
identity. (Here and throughout this article, I use the term object in
its ordinary English sense, not in the special rather loaded sense in which
it's used in the OO world.)
2.
In a collection of objects in which there are no duplicates
(in other words, a mathematical set), objects obviously do have identity,
because they are in fact self-identifying. For example, in the set of
integers {3,6,8,11}, there's no ambiguity as to which element of the set is
"6" and which is "8" (etc.). However, in the collection
(3,6,6,8,8,8,11), which is certainly not a mathematical set (it is a multiset
or bag instead), we cannot make an analogous statement; both "6" and
"8" are now ambiguous.
3.
So what is the identification mechanism in a collection
that permits duplicates (in other words, a bag)? For example, in the bag just
shown, how can we distinguish the two "6"s from one another? Note
that there must still be an identification mechanism; if we cannot
distinguish the two "6"s from one another somehow, we cannot even
tell there are two of them (an illustration of what I believe is known in
the world of philosophy as the Principle of Identity of Indiscernibles). In
other words, we would not even know there were any duplicates in the
first place!
Now, a common reaction to this argument is "But I really
don't need to distinguish among the duplicates, all I want to do is be
able to count them." The point I'm trying to make is that you do need to distinguish
them, even just to count them (for otherwise how do you know which ones
you've already counted?). This point is crucial, of course, and I really don't
know how to make it any more strongly than I already have.
How then do we distinguish duplicates such as the two
"6"s in the bag shown above? The answer, of course, is that we do so
by their relative position; we say something like "this 6 is here
and that 6 is there," or "this one is the first 6
and that one is the second."
And so we have now introduced a totally new concept, one that
is quite deliberately omitted from the relational model: positional
addressing. Which means that we are now quite beyond the pale!--that is, we
have moved quite outside the framework of relational theory. Which means that
there is no guarantee whatsoever that any results that hold within that
framework still apply. For example, does JOIN still work? (As a matter of fact,
it doesn't.) What about UNION? EXTEND? SUMMARIZE? Are the theorems of
normalization still valid? What about the quantifiers EXISTS and FORALL? What
about the rules for functional dependency inference? What about expression
transformation and optimization? Etc., etc., etc.
Furthermore, we now definitely need certain additional
operators, such as "retrieve the nth row" or "insert this
new row here" or "move this row from here to there."
In my opinion--pace Beech--these operators constitute a much greater burden on
the user than does the occasional need to invent an artificial identifier.
The relational model, by contrast, adopts the position that,
since objects do have to be identifiable somehow, then we might as well
represent their identity in exactly the same way as everything else:
namely, by values in columns. (Especially as there will often be a
"natural" identifier that is usable as such a column value anyway,
which means that the problem of having to invent an artificial value might not
arise all that often in practice.) In this way we can stay securely within the
context of relational theory, and all of the desirable properties of that
theory will thus still be directly applicable.
To return for a moment to the cat-food argument, Beech goes
on to say:
“We are not being less than respectable mathematically if we
consider collections containing duplicates, because mathematicians deal with
such collections, called multisets or ... bags.”
The point here seems to be that the usual advantage claimed
for the relational model, to the effect that the model is at least
"mathematically respectable," can be claimed by the duplicate-row
advocates for their "model" too. But all of the mathematical
"bag theory" treatments I've ever seen start off by assuming there's
a way to count duplicates! And that assumption, I contend, effectively means
that bags are defined in terms of sets--each bag element really has a
hidden identifying tag that distinguishes it somehow, and the bag is really a set
of tag/element pairs. I see no advantage, and definite disadvantages, in
introducing this extra level of complexity.
Expression Transformation
The foregoing argument should be sufficient (I hope) to show
why I think it was a mistake for both theoretical and practical reasons--to
include duplicate support in SQL in the first place. Now I'd like to go on to
argue that, even though duplicates are supported, you should take care
to avoid them in practice. (The following argument is based on an example
originally due to Nat Goodman.)
The fundamental point is that--as I hinted in the previous
section--expression transformations, and hence optimizations, that are valid in
a relational context are not necessarily valid in the presence of
duplicates. Here's an example. Consider the database shown in Fig. 1.
Table 'P'
|
P#
|
PNAME
|
|
P1
|
Screw
|
|
P1
|
Screw
|
|
P1
|
Screw
|
|
P2
|
Screw
|
Table 'SP':
Fig. 1: A
database with duplicates
Perhaps I should begin by asking the question: What does it
mean to have three "(P1,Screw)" rows in table P and not two, or four,
or seventeen? It must mean something, for if it means nothing, then why are the
duplicates there in the first place? To paraphrase a point first nicely made by
Codd (in a live presentation): If something is true, saying it twice doesn't
make it any truer.
So let's assume that there is indeed some meaning attached to
the existence of duplicates, even though that meaning, whatever it is, is
hardly very explicit. (I note in passing, therefore, that duplicates contravene
another of the objectives of the relational model, namely the objective of
explicitness: that is, the meaning of the data should be as explicit as
possible. The presence of duplicates implies that part of the meaning is
hidden.) In other words, given that duplicates do have some meaning, there are
presumably going to be business decisions made on the basis of the fact that,
for example, there are three "(P1,Screw)" rows in table P and not two
or four or seventeen. (For if not, then why are the duplicates there in the
first place?)
Now consider the following query: "List part numbers for
parts that either are screws or are supplied by supplier S1, or both."
Here are some candidate SQL formulations for this query, together with the
output produced in each case.
Note: Thanks to Jim Panttaja for
checking these results for me, using Microsoft SQL Server Release 4.2a running
on OS/2. See also Chapter 4, Section 4b, of Fabian Pascal' s book PRACTICAL ISSUES IN DATABASE MANAGEMENT:
A
REFERENCE FOR THE THINKING PRACTITIONER, which reports on the
results of running a similar experiment. In the interests of accuracy, I should
add that a couple of the "candidate formulations" (which?) are
perhaps not true candidates anyway, inasmuch as they assume that every part
that is a screw is supplied by at least one supplier (thanks to Hugh Darwen for
this observation).
1.
SELECT p#
FROM p
WHERE pname = 'Screw'
OR p# IN (SELECT p#
FROM sp
WHERE s# = 's1');
Result: P1*3,P2*1.
2.
SELECT p#
FROM sp
WHERE s# = 's1'
OR p# IN (SELECT p#
FROM p
WHERE pname = 'Screw');
Result:P1*2,P2*1.
3.
SELECT p.p#
FROM p,sp
WHERE (s# = 's1' AND p.p# = sp.p#)
OR pname = 'Screw';
Result:P1*9,P2*3.
4.
SELECT sp.p#
FROM p,sp
WHERE (s# = 's1' AND p.p# = sp.p#)
OR pname = 'Screw';
Result:P1*8,P2*4.
5.
SELECT p#
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT p#
FROM sp
WHERE s# = 's1';
Result: P1*5,P2*2.
6.
SELECT DISTINCT p#
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT p#
FROM sp
WHERE s# = 's1';
Result:P1*3,P2*2.
7.
SELECT p#
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT DISTINCT p#
FROM sp
WHERE s# = 's1';
Result: P1*4,P2*2.
8.
SELECT DISTINCT p#
FROM p
WHERE pname = 'Screw'
OR p# IN (SELECT p#
FROM sp
WHERE s# = 's1');
Result: P1*1,P2*1.
9.
SELECT DISTINCT p#
FROM sp
WHERE s# = 's1'
OR p# IN (SELECT p#
FROM p
WHERE pname = 'Screw');
Result: P1*1,P2*1.
10.
SELECT p#
FROM p
GROUP BY p#,pname
HAVING pname = 'Screw'
OR p# IN (SELECT p#
FROM sp
WHERE s# = 's1');
Result: P1*1,P2*1.
11.
SELECT p.p#
FROM p,sp
GROUP BY p.p#,pname,s#,sp.p#
HAVING (s# = 's1' AND p.p# = sp.p#)
OR pname = 'Screw';
Result: P1 * 2, P2 * 2.
12.S
ELECT p#
FROM p
WHERE pname = 'Screw'
UNION
SELECT p#
FROM sp
WHERE s# = 's1';
Result: P1*1,P2*1.
The obvious first point to make is that the twelve different formulations
produce nine different results!--different, that is, with respect to
their degree of duplication. (I make no claim, incidentally, that either
the twelve different formulations or the nine different results are the only
ones possible; indeed, they aren't, in general.) Thus, if the user really cares
about duplicates, then he or she needs to be extremely careful in
formulating the query appropriately in order to obtain exactly the desired
result.
Furthermore, of course, analogous remarks apply to the system
itself: Because different formulations can produce different results, the
system optimizer too has to be extremely careful in its task of
expression transformation (that is, transforming one formulation into another).
In other words, duplicate rows act as a significant optimization inhibitor
(see Part 2 of my article Expression Transformationin RELATIONAL DATABASE
WRITINGS 1989-1991). Here are some implications of this point:
·
First, the optimizer code itself is harder to write,
harder to maintain, and probably more buggy--all of which combines to make the
product simultaneously more expensive and less reliable, as well as late in
delivery in the marketplace.
·
Second, system performance is likely to be worse than
it might otherwise be.
·
Third, the user is going to have to get involved in
performance issues; for instance, the user might have to spend time and effort
on figuring out the best way to express a given query (a state of
affairs, incidentally, that the relational model was explicitly designed to
avoid).
Note: Just as an aside, you might want to try out the
twelve formulations--and any others you can think of--on your own DBMS. You
might discover some interesting things about your optimizer! For my part, I've
certainly encountered products that do not handle the degree of duplication
correctly in all cases--presumably because they are making some expression
transformations that are technically incorrect. In this connection, let me also
draw your attention to another article of mine, Fifty Ways to Quote Your
Querypublished on the website www.dbpd.com in July 1998.
To get back to the main thread of my argument: The foregoing
state of affairs (regarding the fact that duplicates serve as an optimization
inhibitor) is particularly frustrating in view of the fact that, in most cases,
the user probably does not really care how many duplicates appear in the
result. In other words:
a.
Different formulations produce different results, as
demonstrated above; however,
b.
The differences are probably irrelevant from the user's
point of view; but
c.
The optimizer is not aware of this latter fact and is
therefore prevented–-unnecessarily--from performing the transformations it
would like to perform.
On the basis of examples like the foregoing, I would conclude
among other things that users should always ensure that query results
contain no duplicates--for example, by always specifying DISTINCT at
appropriate points in the query--and thus simply forget about the whole
problem. (And if this advice is followed, of course, then there can be no good
reason for allowing duplicates in the database in the first place.)
Note: The alternative in SQL to
SELECT DISTINCT is SELECT ALL (and SELECT ALL is unfortunately the default).
The discussion of the foregoing sections suggests that a more apt alternative
might have been SELECT in DISTINCT ... On a more serious note: The
trouble is, of course, that SELECT DISTINCT takes longer to execute than SELECT
ALL, in general, even if the DISTINCT is effectively a "no-op." But
this problem arises because SQL systems are typically unable to optimize
properly over duplicate elimination, owing to their lack of knowledge of key
inheritance (see my article The Power of the Keys in RELATIONAL DATABASE WRITINGS
1989-1991).
And even if duplicate elimination does sometimes give rise to some performance
overhead, I would still argue that such overhead is a very minor matter when
regarded from the point of view of "the big picture."
A couple of further points to close this section: First, one
reviewer of an earlier draft objected that users don't really have
duplicates in base tables, and the example discussed above thus intuitively
fails. Well, OK; but the trouble is, SQL can generate duplicates in the
results of queries! Indeed, different formulations of "the same"
query can produce results with different degrees of duplication, even if the
input tables themselves don't have any duplicates. By way of example, consider
the following formulations of the query "Get supplier numbers for
suppliers who supply at least one part" on the usual suppliers-and-parts
database:
SELECT s.s#
FROM s
WHERE s.s# IN (SELECT sp.s#
FROM sp);
vs.
SELECT s.s#
FROM s,sp
WHERE s.s# = sp.s#
So if you don't want to think of the tables in Fig. 1 as base tables
specifically, that's fine; just take them to be the output from some previous
queries, and the rest of the analysis goes through unchanged.
The second point is this. Suppose a given table T does permit
duplicates. Then we can't tell the difference between "genuine"
duplicates in T and duplicates that arise from errors in data entry operations
on T! For example, what happens if the person responsible for data entry
unintentionally--that is, by mistake--enters the very same row into T
twice? (Thanks to Fabian Pascalagain
for drawing my attention to this problem.)
Rows Represent Propositions
There's yet another relevant point to be made on this
subject. As I've explained in many places and on many occasions, any given
relational table denotes a certain predicate, and the rows of that table
denote certain true propositions, obtained from the predicate by
substituting certain arguments--that is, values of the appropriate type--for
the placeholders or parameters of that predicate
("instantiating the predicate"). For example, consider the relational
table EMP {EMP#, ENAME, DEPT#, SALARY}. The predicate here looks something like
this:
§
Employee EMP# is named ENAME, works in department
DEPT#, and earns salary SALARY
(the parameters are EMP#, ENAME, DEPT#, and SALARY, corresponding of course to
the four EMP columns). And the corresponding true propositions might look like
this:
§ Employee
E1 is named Lopez, works in department D1, and earns salary 40K
§ Employee
E2 is named Cheng, works in department D1, and earns salary 42K
and so on.
Now, the logical operator AND is idempotent. Among
other things, what this means is that if P is a proposition, then P AND P is
equivalent to just P. For example, if I say "the sun is shining here
today" and "the sun is shining here today," I'm simply telling
you the sun is shining here today! And from this perspective, the notion of
duplicate rows--as that notion is usually understood--obviously makes no
sense at all.
Conclusion
Duplicate row support should be dropped. A strategy for doing
so gracefully was outlined by Codd in his book THE RELATIONAL MODEL FOR
DATABASE MANAGEMENT VERSION 2 (Addison-Wesley, 1990):
1.
Implement an installation-time switch in the DBMS so that the
DBA can specify whether duplicates are to be eliminated (a) in all cases--in
other words, automatically--or (b) only on user request;
2.
Announce that support for Case (b) will be dropped in (say)
two or three years' time;
3.
Drop that support at the appropriate time, simultaneously
upgrading the optimizer to take advantage of the now guaranteed lack of
duplicates.
Technical Correspondence
After the foregoing installment first appeared in my original
DBP&D series, I received a long letter from Robert A. Alps of
Evanston, Illinois. Mr. Alps made so many points in his letter that I decided
to devote a complete installment to replying to him; that installment appears
as Part 2 of the present article.
I also received a letter from Chuck Reinke of Concord,
California, who felt that I had not responded satisfactorily to the cat-food
example. In an attempt to clarify his objections, he suggested the following
example:
“Let's consider rats and ratlets. Whenever there's
a new litter, I want to create an entry for each new ratlet. When just born
they are indistinguishable ... Yet, as they grow, I can distinguish certain
ones by color or behavior. it's time to assign them a unique key.
As Date would have it, prior to this stage ratlets should be
banned from relational representation. Assigning each ratlet an arbitrary
unique key implies nonexistent information, that the ratlets are
distinguishable ... The inadequacy of SQL is a poor argument for prohibiting
duplicates in relational design. I want to keep track of practical real-world
information, not create mathematical abstractions of Aristotelian
purity. “
Space constraints at the time required me to keep my response
short, so I refrained from pointing out that (a) the arguments against
duplicates have nothing to do with the inadequacies of SQL per se, (b)
the distinguishability of the ratlets is a fact, not "nonexistent
information," and (c) "keeping track of practical real-world
information" in a database necessarily involves creating some kind
of "mathematical abstraction"! Instead, I simply replied as follows.
The obvious design for the ratlets problem is:
LITTERS (LITTER_ID,#_OF_RATLETS)
PRIMARY KEY (LITTER_ID)
RATLETS (RATLET_ID,LITTER_ID)
PRIMARY KEY (RATLET_ID)
FOREIGN KEY (LITTER_ID)
REFERENCES LITTERS
When there's a new litter, we make the obvious entry in LITTERS. When an
individual ratlet becomes "interesting" (unlike Mr. Reinke, I do not
say "distinguishable," because distinguishability presupposes
identity), we make the obvious entry in RATLETS.
Posted
03/18/02
[ABOUT]
[QUOTES]
[LINKS]