DOUBLE TROUBLE, DOUBLE TROUBLE PART 1
by Chris Date

 

 

 

As tedious as a twice-told tale.

--William Shakespeare

 

 

Preliminaries

 

What follows is far from new; I just feel, sadly, that it needs to be aired still one more time... The immediate cause for my feeling this way was a letter I received recently from a colleague in Finland, Lauri Pietarinen, which read in part as follows:

 

"I just came back from the [September 2001] VLDB Conference in Rome ... I was talking to <name suppressed to protect the guilty> about why we should stick to "real" relations instead of the current tuple-bag implementations ... It was his opinion that from the optimization point of view we have nothing to gain from using relations as compared to the current situation. He referred to the book DATABASE SYSTEM IMPLEMENTATION by Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom, where he said a tuple-bag algebra had been presented ... Could you please summarize the reasons for using relations ... I know you have addressed this matter in numerous writings, but just to summarize the issue (?) ... The point of view mainly of interest was that of optimization (that is, can we do more optimization with relations?)."

 

Well, it's certainly true that I've "addressed this matter in numerous writings" -- and I really don't think I can do better to "summarize the issue" than repeat some material that originally appeared in my regular column for Database Programming & Design, in Installments Nos. 17 (Toil and Trouble DBP&D 7, No. 1, January 1994) and 34 (And Cauldron Bubble" DBP&D 8, No. 6, June 1995). (These two installments were republished-- the second under its original title The Department of Redundancy Department--by permission of Miller Freeman Inc. in my books RELATIONAL DATABASE WRITINGS 1991-1994 and RELATIONAL DATABASE WRITINGS 1994-1997, respectively.) The two-part article that follows, therefore, is for the most part just a lightly edited version of those two installments. However, it does also contain some new material, including in particular an appendix to the second part that discusses the tuple-bag algebra of Garcia-Molina et al. in some detail. As requested, that appendix does consider the implications of that algebra for optimization.

 

Note:As we'll see from that discussion--as well as from material in Part 1 of the article--the answer to Lauri Pietarinen's question ("Can we do more optimization with relations?") is very definitely yes
 

 

Introduction

 

I've stated in many places and on many occasions that duplicate rows (hereinafter usually abbreviated to just duplicates) are, and always were, a mistake in SQL. In fact, it's my position that (a) duplicate rows should never have been permitted in the first place, but (b) given that they were permitted, they ought to be avoided in practice. In what follows, I give some solid reasons for this position. 
 

The Cat Food Example

 

The following extract from an article by David Beech (New Life for SQL, Datamation, February 1st, 1989) is typical of the arguments that are advanced by duplicate-row advocates.

 

“For example, the row "cat food 0.39" could appear three times [on a supermarket checkout receipt] with a significance that would not escape many shoppers ... At the level of abstraction at which it is useful to record the information, there are no value components that distinguish the objects. What the relational model does is force people to lower the level of abstraction, often inventing meaningless values to be inserted in an extra column whose purpose is to show what we knew already, that the cans of cat food are distinct.”

 

Apart from the remark regarding "lowering the level of abstraction," which I think is just arm-waving, this seems to me to be exactly the straw-man argument I gave in my article Why Duplicate Rows Are Prohibited (in RELATIONAL DATABASE WRITINGS 1985-1989, under the heading Why Duplicates Are Good (?) -- to wit:

 

1. Duplicates occur naturally in practice. 

2. Given that this is so, it's a burden on the user to have to invent some artificial identifier in order to distinguish between them.

 

Note: If anything, in fact, the relational representation raises the level of abstraction, because it eliminates irrelevant details--details, that is, that have to do merely with the way data is presented to the end user.  After all, the checkout receipt is really just a report (it certainly isn't part of the database!), and there's no particular reason why we should have to represent data in the same way in the database and in a report; in fact, there are good reasons not to. (Thanks to Fabian Pascal for these observations, and more generally for his review of--and helpful comments on--an earlier draft of this article.)

 

In a subsequent section of that same article, entitled Why Duplicates Are Bad: The Fundamental Issue, I went on to refute this argument as follows:

 

1.      Individual objects must be identifiable (that is, distinguishable from all other objects) -- for if an object is not identifiable, then it's impossible even to talk about it, let alone perform any kind of operation upon it or use it for any sensible purpose. In other words, objects must have identity. (Here and throughout this article, I use the term object in its ordinary English sense, not in the special rather loaded sense in which it's used in the OO world.)

2.      In a collection of objects in which there are no duplicates (in other words, a mathematical set), objects obviously do have identity, because they are in fact self-identifying. For example, in the set of integers {3,6,8,11}, there's no ambiguity as to which element of the set is "6" and which is "8" (etc.). However, in the collection (3,6,6,8,8,8,11), which is certainly not a mathematical set (it is a multiset or bag instead), we cannot make an analogous statement; both "6" and "8" are now ambiguous.

3.      So what is the identification mechanism in a collection that permits duplicates (in other words, a bag)? For example, in the bag just shown, how can we distinguish the two "6"s from one another? Note that there must still be an identification mechanism; if we cannot distinguish the two "6"s from one another somehow, we cannot even tell there are two of them (an illustration of what I believe is known in the world of philosophy as the Principle of Identity of Indiscernibles). In other words, we would not even know there were any duplicates in the first place!

 

Now, a common reaction to this argument is "But I really don't need to distinguish among the duplicates, all I want to do is be able to count them." The point I'm trying to make is that you do need to distinguish them, even just to count them (for otherwise how do you know which ones you've already counted?). This point is crucial, of course, and I really don't know how to make it any more strongly than I already have.

 

How then do we distinguish duplicates such as the two "6"s in the bag shown above? The answer, of course, is that we do so by their relative position; we say something like "this 6 is here and that 6 is there," or "this one is the first 6 and that one is the second."

 

And so we have now introduced a totally new concept, one that is quite deliberately omitted from the relational model: positional addressing. Which means that we are now quite beyond the pale!--that is, we have moved quite outside the framework of relational theory. Which means that there is no guarantee whatsoever that any results that hold within that framework still apply. For example, does JOIN still work? (As a matter of fact, it doesn't.) What about UNION? EXTEND? SUMMARIZE? Are the theorems of normalization still valid? What about the quantifiers EXISTS and FORALL? What about the rules for functional dependency inference? What about expression transformation and optimization? Etc., etc., etc.

 

Furthermore, we now definitely need certain additional operators, such as "retrieve the nth row" or "insert this new row here" or "move this row from here to there." In my opinion--pace Beech--these operators constitute a much greater burden on the user than does the occasional need to invent an artificial identifier.

 

The relational model, by contrast, adopts the position that, since objects do have to be identifiable somehow, then we might as well represent their identity in exactly the same way as everything else: namely, by values in columns. (Especially as there will often be a "natural" identifier that is usable as such a column value anyway, which means that the problem of having to invent an artificial value might not arise all that often in practice.) In this way we can stay securely within the context of relational theory, and all of the desirable properties of that theory will thus still be directly applicable.

 

To return for a moment to the cat-food argument, Beech goes on to say:

 

“We are not being less than respectable mathematically if we consider collections containing duplicates, because mathematicians deal with such collections, called multisets or ... bags.”

 

The point here seems to be that the usual advantage claimed for the relational model, to the effect that the model is at least "mathematically respectable," can be claimed by the duplicate-row advocates for their "model" too. But all of the mathematical "bag theory" treatments I've ever seen start off by assuming there's a way to count duplicates! And that assumption, I contend, effectively means that bags are defined in terms of sets--each bag element really has a hidden identifying tag that distinguishes it somehow, and the bag is really a set of tag/element pairs. I see no advantage, and definite disadvantages, in introducing this extra level of complexity. 


 

Expression Transformation

 

The foregoing argument should be sufficient (I hope) to show why I think it was a mistake for both theoretical and practical reasons--to include duplicate support in SQL in the first place. Now I'd like to go on to argue that, even though duplicates are supported, you should take care to avoid them in practice. (The following argument is based on an example originally due to Nat Goodman.)

 

The fundamental point is that--as I hinted in the previous section--expression transformations, and hence optimizations, that are valid in a relational context are not necessarily valid in the presence of duplicates. Here's an example. Consider the database shown in Fig. 1.

 

Table 'P'

P#

PNAME

P1

Screw

P1

Screw

P1

Screw

P2

Screw

 

Table 'SP':

S#

P#

S1

P1

S1

P1

S1

P2

                         

                      Fig. 1: A database with duplicates

 

Perhaps I should begin by asking the question: What does it mean to have three "(P1,Screw)" rows in table P and not two, or four, or seventeen? It must mean something, for if it means nothing, then why are the duplicates there in the first place? To paraphrase a point first nicely made by Codd (in a live presentation): If something is true, saying it twice doesn't make it any truer.

 

So let's assume that there is indeed some meaning attached to the existence of duplicates, even though that meaning, whatever it is, is hardly very explicit. (I note in passing, therefore, that duplicates contravene another of the objectives of the relational model, namely the objective of explicitness: that is, the meaning of the data should be as explicit as possible. The presence of duplicates implies that part of the meaning is hidden.) In other words, given that duplicates do have some meaning, there are presumably going to be business decisions made on the basis of the fact that, for example, there are three "(P1,Screw)" rows in table P and not two or four or seventeen. (For if not, then why are the duplicates there in the first place?)

 

Now consider the following query: "List part numbers for parts that either are screws or are supplied by supplier S1, or both." Here are some candidate SQL formulations for this query, together with the output produced in each case.

 

Note: Thanks to Jim Panttaja for checking these results for me, using Microsoft SQL Server Release 4.2a running on OS/2. See also Chapter 4, Section 4b, of Fabian Pascal' s book PRACTICAL ISSUES IN DATABASE MANAGEMENT: A REFERENCE FOR THE THINKING PRACTITIONER, which reports on the results of running a similar experiment. In the interests of accuracy, I should add that a couple of the "candidate formulations" (which?) are perhaps not true candidates anyway, inasmuch as they assume that every part that is a screw is supplied by at least one supplier (thanks to Hugh Darwen for this observation). 
 

1.

SELECT p#
FROM p
WHERE pname = 'Screw'
OR p# IN (SELECT p#
FROM sp
WHERE s# = 's1'); 

 

Result: P1*3,P2*1.

 

2.

SELECT p#
FROM sp
WHERE s# = 's1'
OR p# IN (SELECT p#
FROM p
WHERE pname = 'Screw');

 

Result:P1*2,P2*1.

 

3.

SELECT p.p#
FROM p,sp
WHERE (s# = 's1' AND p.p# = sp.p#)
OR pname = 'Screw';

 

Result:P1*9,P2*3.

 

4.

SELECT sp.p#
FROM p,sp
WHERE (s# = 's1' AND p.p# = sp.p#)
OR pname = 'Screw';

 

Result:P1*8,P2*4.

 

5.

SELECT p#
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT p#
FROM sp
WHERE s# = 's1';

 

Result: P1*5,P2*2.

 

6.

SELECT DISTINCT p#
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT p#
FROM sp
WHERE s# = 's1';

 

Result:P1*3,P2*2.

 

7.

SELECT p#
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT DISTINCT p#
FROM sp
WHERE s# = 's1';

 

Result: P1*4,P2*2.

 

8.

SELECT DISTINCT p#
FROM p
WHERE pname = 'Screw'
OR p# IN (SELECT p#
FROM sp
WHERE s# = 's1');

 

Result: P1*1,P2*1.

 

9.

SELECT DISTINCT p#
FROM sp
WHERE s# = 's1'
OR p# IN (SELECT p#
FROM p
WHERE pname = 'Screw'); 

 

Result: P1*1,P2*1.

 

10.

SELECT p#
FROM p
GROUP BY p#,pname
HAVING pname = 'Screw'
OR p# IN (SELECT p#
FROM sp
WHERE s# = 's1');

 

Result: P1*1,P2*1.

 

11.

SELECT p.p#
FROM p,sp
GROUP BY p.p#,pname,s#,sp.p#
HAVING (s# = 's1' AND p.p# = sp.p#)
OR pname = 'Screw';

 

Result: P1 * 2, P2 * 2.

 

12.S

ELECT p#
FROM p
WHERE pname = 'Screw'
UNION
SELECT p#
FROM sp
WHERE s# = 's1';

 

Result: P1*1,P2*1. 


The obvious first point to make is that the twelve different formulations produce nine different results!--different, that is, with respect to their degree of duplication. (I make no claim, incidentally, that either the twelve different formulations or the nine different results are the only ones possible; indeed, they aren't, in general.) Thus, if the user really cares about duplicates, then he or she needs to be extremely careful in formulating the query appropriately in order to obtain exactly the desired result.

 

Furthermore, of course, analogous remarks apply to the system itself: Because different formulations can produce different results, the system optimizer too has to be extremely careful in its task of expression transformation (that is, transforming one formulation into another). In other words, duplicate rows act as a significant optimization inhibitor (see Part 2 of my article Expression Transformationin RELATIONAL DATABASE WRITINGS 1989-1991). Here are some implications of this point:

 

·   First, the optimizer code itself is harder to write, harder to maintain, and probably more buggy--all of which combines to make the product simultaneously more expensive and less reliable, as well as late in delivery in the marketplace. 

·   Second, system performance is likely to be worse than it might otherwise be. 

·   Third, the user is going to have to get involved in performance issues; for instance, the user might have to spend time and effort on figuring out the best way to express a given query (a state of affairs, incidentally, that the relational model was explicitly designed to avoid).

 

Note: Just as an aside, you might want to try out the twelve formulations--and any others you can think of--on your own DBMS. You might discover some interesting things about your optimizer! For my part, I've certainly encountered products that do not handle the degree of duplication correctly in all cases--presumably because they are making some expression transformations that are technically incorrect. In this connection, let me also draw your attention to another article of mine, Fifty Ways to Quote Your Querypublished on the website www.dbpd.com in July 1998. 

 

To get back to the main thread of my argument: The foregoing state of affairs (regarding the fact that duplicates serve as an optimization inhibitor) is particularly frustrating in view of the fact that, in most cases, the user probably does not really care how many duplicates appear in the result. In other words:

 

a.      Different formulations produce different results, as demonstrated above; however,

b.      The differences are probably irrelevant from the user's point of view; but

c.      The optimizer is not aware of this latter fact and is therefore prevented–-unnecessarily--from performing the transformations it would like to perform.

 

On the basis of examples like the foregoing, I would conclude among other things that users should always ensure that query results contain no duplicates--for example, by always specifying DISTINCT at appropriate points in the query--and thus simply forget about the whole problem. (And if this advice is followed, of course, then there can be no good reason for allowing duplicates in the database in the first place.)

 

Note: The alternative in SQL to SELECT DISTINCT is SELECT ALL (and SELECT ALL is unfortunately the default). The discussion of the foregoing sections suggests that a more apt alternative might have been SELECT in DISTINCT ... On a more serious note: The trouble is, of course, that SELECT DISTINCT takes longer to execute than SELECT ALL, in general, even if the DISTINCT is effectively a "no-op." But this problem arises because SQL systems are typically unable to optimize properly over duplicate elimination, owing to their lack of knowledge of key inheritance (see my article The Power of the Keys in RELATIONAL DATABASE WRITINGS 1989-1991). And even if duplicate elimination does sometimes give rise to some performance overhead, I would still argue that such overhead is a very minor matter when regarded from the point of view of "the big picture."

 

A couple of further points to close this section: First, one reviewer of an earlier draft objected that users don't really have duplicates in base tables, and the example discussed above thus intuitively fails. Well, OK; but the trouble is, SQL can generate duplicates in the results of queries! Indeed, different formulations of "the same" query can produce results with different degrees of duplication, even if the input tables themselves don't have any duplicates. By way of example, consider the following formulations of the query "Get supplier numbers for suppliers who supply at least one part" on the usual suppliers-and-parts database: 

 
SELECT s.s# 
FROM s
WHERE s.s# IN (SELECT sp.s#
FROM sp);


vs.
 

SELECT s.s#
FROM s,sp
WHERE s.s# = sp.s#


So if you don't want to think of the tables in Fig. 1 as base tables specifically, that's fine; just take them to be the output from some previous queries, and the rest of the analysis goes through unchanged.

 

The second point is this. Suppose a given table T does permit duplicates. Then we can't tell the difference between "genuine" duplicates in T and duplicates that arise from errors in data entry operations on T! For example, what happens if the person responsible for data entry unintentionally--that is, by mistake--enters the very same row into T twice? (Thanks to Fabian Pascalagain for drawing my attention to this problem.)


 

Rows Represent Propositions

 

There's yet another relevant point to be made on this subject. As I've explained in many places and on many occasions, any given relational table denotes a certain predicate, and the rows of that table denote certain true propositions, obtained from the predicate by substituting certain arguments--that is, values of the appropriate type--for the placeholders or parameters of that predicate ("instantiating the predicate"). For example, consider the relational table EMP {EMP#, ENAME, DEPT#, SALARY}. The predicate here looks something like this: 
 

§   Employee EMP# is named ENAME, works in department DEPT#, and earns salary SALARY 


(the parameters are EMP#, ENAME, DEPT#, and SALARY, corresponding of course to the four EMP columns). And the corresponding true propositions might look like this: 
 

§   Employee E1 is named Lopez, works in department D1, and earns salary 40K

§   Employee E2 is named Cheng, works in department D1, and earns salary 42K 


and so on.

 

Now, the logical operator AND is idempotent. Among other things, what this means is that if P is a proposition, then P AND P is equivalent to just P. For example, if I say "the sun is shining here today" and "the sun is shining here today," I'm simply telling you the sun is shining here today! And from this perspective, the notion of duplicate rows--as that notion is usually understood--obviously makes no sense at all. 
 
 

Conclusion

 

Duplicate row support should be dropped. A strategy for doing so gracefully was outlined by Codd in his book THE RELATIONAL MODEL FOR DATABASE MANAGEMENT VERSION 2 (Addison-Wesley, 1990): 

 

1.      Implement an installation-time switch in the DBMS so that the DBA can specify whether duplicates are to be eliminated (a) in all cases--in other words, automatically--or (b) only on user request;

2.      Announce that support for Case (b) will be dropped in (say) two or three years' time;

3.      Drop that support at the appropriate time, simultaneously upgrading the optimizer to take advantage of the now guaranteed lack of duplicates.

 

Technical Correspondence

 

After the foregoing installment first appeared in my original DBP&D series, I received a long letter from Robert A. Alps of Evanston, Illinois. Mr. Alps made so many points in his letter that I decided to devote a complete installment to replying to him; that installment appears as Part 2 of the present article.

 

I also received a letter from Chuck Reinke of Concord, California, who felt that I had not responded satisfactorily to the cat-food example. In an attempt to clarify his objections, he suggested the following example:

 

“Let's consider rats and ratlets. Whenever there's a new litter, I want to create an entry for each new ratlet. When just born they are indistinguishable ... Yet, as they grow, I can distinguish certain ones by color or behavior. it's time to assign them a unique key.

 

As Date would have it, prior to this stage ratlets should be banned from relational representation. Assigning each ratlet an arbitrary unique key implies nonexistent information, that the ratlets are distinguishable ... The inadequacy of SQL is a poor argument for prohibiting duplicates in relational design. I want to keep track of practical real-world information, not create mathematical abstractions of Aristotelian purity. “

 

Space constraints at the time required me to keep my response short, so I refrained from pointing out that (a) the arguments against duplicates have nothing to do with the inadequacies of SQL per se, (b) the distinguishability of the ratlets is a fact, not "nonexistent information," and (c) "keeping track of practical real-world information" in a database necessarily involves creating some kind of "mathematical abstraction"! Instead, I simply replied as follows.

 

The obvious design for the ratlets problem is: 
 

LITTERS (LITTER_ID,#_OF_RATLETS)
PRIMARY KEY (LITTER_ID)
RATLETS (RATLET_ID,LITTER_ID)
PRIMARY KEY (RATLET_ID)
FOREIGN KEY (LITTER_ID)
REFERENCES LITTERS


When there's a new litter, we make the obvious entry in LITTERS. When an individual ratlet becomes "interesting" (unlike Mr. Reinke, I do not say "distinguishable," because distinguishability presupposes identity), we make the obvious entry in RATLETS.

 

 

Posted 03/18/02

 

 

 

[ABOUT] [QUOTES] [LINKS]