(Thanks to Chris Date for a review of and helpful comments on
a draft of this article.)
A reader has recently forwarded an interview with Microsoft's
Jim Gray about the future of SQL Server 2000 and "extreme scaling from
databases" (whatever that means), published on February 1, 2002. For those
who don't know, "[Gray] founded and runs Microsoft's Bay Area Research
Center (BARC), a research lab that focuses on scalable computing by building
superservers and workgroup systems from commodity software and hardware. Jim
Gray was a key player in building the TerraServer, a 12-terabyte spatial data
warehouse using Microsoft SQL Server 2000 as its building blocks."
Jim Gray is one of the recipients of the industry's top
Turing Award, another recipient being Ted Codd, the inventor of the relational
model. And it is in the context of the award that I was quite disappointed with
the interview. I have yet to encounter one instance of association of technical
expertise with a commercial interest, where the latter has not had some
deleterious effect--be it significant or mild, obvious or subtle, conscious or
not--on the former (Codd resigned from IBM to avoid it). The Gray interview
does nothing to undermine my discomfort with such associations.
Now, from personal experience I am very much aware that the
trade media tends to distort articles or interviews, rendering them
incomprehensible, or even reversing intended meaning. This is one reason I
refrain from writing for, or giving interviews to the mainstream media -- too
often I would not recognize my material when it was published--and why I take
the trouble to maintain this web site. I have to assume that Gray's interview
is plagued at least to some degree by this or, at the very least, I would like
to hope this is the case. Indeed, many of the questions make no sense and in
the main I ignore those parts. Unfortunately, the average reader will be misled
whether the published interview accurately reflects Gray's pronouncements or
not. I would like, therefore, to emphasize that the following critique is of
the interview as published.
Let's consider the parts of the interview that merit comments
(ignoring those that make no sense, or have too obvious commercial tones).
Q: How will the upcoming integration of .NET into SQL
Server affect performance, use of SQL Server as a data source for Web services,
and so on. And how will it work? Will SQL Server essentially become a .NET
host? Or just hooks with .NET Framework namespaces and classes?
JG:The
integration of SQL Server with .NET is a major advance. It's the next step
after OLE DB-integrating objects with databases. SQL Server becomes the data
integrator, the way to deal with sets of objects. We have been chasing the
Object-Relational rainbow and we are near the pot of gold-an extensible
database. Now you will be able to store objects in the database, and you will
be able to treat databases as objects. Performance should be even better with
compiled languages than with interpreted T-SQL. And now stored procedures will
be first-class objects.
Object orientation has fundamental flaws for database
management. Essentially, it is a set of guidelines for "good"
programming, pertinent at the application level, and extending it to
data management is an unintended use and a highly questionable proposition at
best, not a pot of gold. As I explain in PRACTICAL ISSUES
IN DATABASE MANAGEMENT, object concepts and terminology are fuzzy,
there is no agreement on them, no sound theoretical foundation and no provision
of a specific data model analogous to the relational model.
"Store objects in the database, and treat databases as objects" are
exactly the kind of statement that object thinking induces and it is unclear
what calling things "objects" contributes to the discussion To quote
from Chris Date's and Hugh Darwen's THIRD MANIFESTO:
"... we acknowledge the desirability of supporting
certain features that are commonly regarded as aspects of object orientation.
However, we believe that the features in question are orthogonal to (i.e.,
independent of) the relational model, and hence the relational model needs no extension,
no correction, no subsumption--and above all, no perversion!--in
order to support [those features]."[emphasis theirs]
Q: SQL Server appears to be taking on more and more
roles for any and all types of static data stores. What types of data stores
can you foresee SQL Server replacing? The Registry? The file services? E-mail
storage?
JG:Yes.
Oracle, DB2, and SQL Server are each on a path to store "files":
photos, videos, voicemail, e-mail, documents, and everything else. The
integration of SQL Server with an object system (.NET) makes this attractive.
Microsoft has been shooting for this for a long time. To the database folks it
seems a natural evolution. To others it seems unnatural. There is a competition
happening-each camp proposing a way to solve the problems. No doubt the end
result will be a synthesis of all these ideas. But I think we all agree that
storage management has to be more automatic and easier. We all have a common
goal. The only debate is how to get there.
"We"
don't all have a common goal. Anyway, files are an implementation, not
a model, aspect and focus on them raises suspicions of a logical-physical
confusion that is prevalent in the industry. The real issue is whether the DBMS
supports user-defined data types of arbitrary complexity (a model
aspect), namely: what representations are declared to users (not
what files are physically stored) and what constraints and operators
are provided for them. As I explain in Chapter 1 of my book,
support of such types raises some nontrivial complications, to put it mildly,
for any DBMS, ODBMS included. Gray is silent on those issues.
Q: How can modern DBMSs deal with the impedance
mismatch between relational databases and nonrelational data (e.g., XML)? What
innovations are on the horizon to make access to hierarchical and nonrelational
data transparent? What's new in terms of performance and scalability with
nonrelational data?
JG: Much has been
made of the impedance
mismatch. SQL is set-oriented and Cobol-C-Java-C# are iterator-oriented (one
record at a time, please). The languages have rich type systems, while
databases have poor type systems. This is about to change with .NET. The
database is a Web service. The type system is universal. You can add databases
to types; you can add types to databases. There is no longer an inside and an
outside-the database is both above and below the type system. XQuery extends
SQL to operate on trees and graphs (unifies SQL and XPath). That story is not
over yet, but XQuery is our best answer so far. We are still exploring how to
handle XML data: Do we shred it all, do we BLOb it, or something in between?
Between is the obvious answer.
How about, Chris Date asks, a reasoned assessment as
to whether XQuery is a good idea (people who understand (truly understand)
the relational model, should say no!) Support of "complex"
data types requires neither XML, nor XQuery.
According to its proponents, XML was invented to serve as a
standard format for data exchange. As such, it does not require--and it
did not initially provide--a data model, namely: data types, organization,
integrity and manipulation. Like object orientation, XML is now
being extended to data management, for which it was explicitly never intended.
And, of course, this is impossible without some data model. That's why
types and querying must be added, so the industry is now reinventing the
database wheel. The problem is, of course, that the XML structure is hierarchic
and, thus, nonrelational, so the wheel being reinvented is a "square"
one, one that we discarded decades ago (see my articles on XML in the Against
the Grain series. Thus, not only is SQL a poor implementation of the
relational model, but it is now further bastardized to support obsolete
nonrelational databases (not to mention pointers!!!) And this is
considered progress?
Q: What are your predictions about the future of SQL
Server Analysis Services in the business world, and the fate of OLAP, data
mining, and business intelligence in general? Is Microsoft doing enough to
evangelize the usefulness of these technologies in industry? What prevents
these technologies from becoming more mainstream and commonplace? Will
Microsoft's new Data Analyzer product help here?
JG: Data Analyzer
will definitely help Analysis
Services both in terms of OLAP and data mining. One aspect of SQL Server is
that everyone gets everything. There is no separate charge for replication or
for OLAP or for data mining or English Query. So the fact that these features
are widely used is not so obvious. I am told that each of these features is
very widely used. Certainly, when I go to the users-group meetings and when I
lurk in the discussion groups, there is a huge amount of interest and
innovation in these areas. The basic evolution seems to be that people build an
operational system using SQL Server, and then they build some ancillary
decision-support systems using that data. Most of those ancillary systems are
built with Analysis Services. This might not seem mainstream, but the
integrated system and tools make it easy to build these analysis systems. This
is a reasonable way to operate.
If "everybody gets everything" is only one
aspect of SQL Server, I wonder what the others are. This is a good example of
the discomfort I was referring to: such claims--not exactly technical--are not
likely to be made except in a commercial context.
From a technical standpoint it is important to distinguish
between industry practices as they are, and what they ought to be. OLAP
is what people are doing, but is it an appropriate solution to the problem
which it purports to solve? The fact is that, like object orientation and XML,
the emergence of OLAP and so-called "analytical services" is yet
another example of the industry ignoring data fundamentals (see On Business
Rules and Integrity Constraints: A Reply to Ralph Kimball. It is yet
another regression to the bad old days of application-biased files,
which were discarded in favor of centralized, application-neutral databases.
It is because the industry chose SQL instead of a true and full implementation
of the relational model, that commercial products are inhibited in
functionality and performance, forcing users to resort to obsolete,
unproductive workarounds, which had already been discarded in the past. A true
implementation of relational technology would have obviated the need OLAP (and,
as I alluded in my writings, there is now an implementation technology which
has the potential to facilitate truly relational products).
Q: Do you see Microsoft technologies gaining greater
acceptance among academic computer scientists? Do you think the ubiquity of
Unix in the academic world is a hindrance to Microsoft's success in the
business world?
JG: All my
academic friends have a Windows
system for their PowerPoint presentations. But they often have a Solaris or
Linux box where they do their programming. Certainly it would be better for
Microsoft if everyone used only our stuff, but that will never happen. In the
SkyServer project the heterogeneity costs us a lot of energy. We have to
support Netscape 4-6 and Opera. All the client tools have to work on Linux.
That makes it much more complex to deploy apps. But that's the reality. As for
Microsoft making progress in academia, that is certainly happening, and it is
mostly being driven by our tools people. Academics are working with languages
like Eiffel to make them compatible with our .NET technologies so that Visual
Studio .NET is a great environment for programming with any language. The
benefit to the academics and their students is that they get a robust, highly
usable environment: a debugger, integrated help, rich editing capabilities, and
so on.
Shouldn't academia be about education, rather than products?
Turns out that it is not less prone to succumb to industry hype than the
business world (see my two editorialsand Denormalization
for Performance – Et Tu Academia?).
Q: What do you see for these kinds of databases in the
next decade? What will we be doing with data then?
JG: When data
volumes get large, you can't grep
them any more, you have to use an index. That's when you need some kind of
database. My group has been trying to figure out how much you should put in the
database-everything or just the indices. As I get into it, I find it is a lot
easier to put everything in the database: photos, videos, images, and of
course, text. Sometimes it is just BLOb data, but often you want to
"parse" the data and extract as much meta information as possible.
For example, we built a database that has all our personal photos. The classic
design would have these photos in a file system and just the metadata in the
database. But it is so much easier to manage when everything is in the
database. There is only one thing to back up and to secure-everything is just
simpler. There are fewer design decisions to make. My prediction is that
eventually all storage systems will evolve to be database systems.
The logical-physical confusion raises its ugly head again. It
is hard to believe that the justification for databases is being reduced to a
need for indexes. Was the migration from application programs and files to
DBMSs and databases mainly for improved performance via indexes? Weren't ISAM
files indexed? Interestingly, the current generation of Java programmers, who
lack knowledge of data fundamentals, explicitly dismiss the use of databases
and DBMSs for precisely the opposite reason: they claim that databases are the
"performance bottleneck" and should be avoided at all costs. For
example:
"With Oracle9i anything you can model in Java, you can
model in Oracle. If just a few good programmers see the value of using this
existing reliable infrastructure rather than recreating it, we can get on with
the next paradigm shift, which is to model our business problems as objects
instead of relationally mapped to Java objects. We're still stuck in the
relational rut just as the hierarchical and network databasers of the past were
stuck in a rut when relational technology first came out."
"The only rules that should reside in a database are
referential integrity (and sometimes that isn't really necessary). It is also
best to keep rules out of your data access code (hardcoding WHERE values)
Business rules should be centralized in Java business objects for better
manageability, scalability, etc. Don't let pushy DBAs tell you otherwise. Rules
in a database slow down development as well as data access time."
What is even more ironic is that indexes were made necessary
by the lack of relational fidelity of SQL and its commercial implementations:
their support of physical data independence is rather poor. SQL DBMSs
maintain a 1:1 relationship between a logical row and a physical record, which
imposes specific physical orderings on rows and on columns, robbing SQL tables
of set properties and constraining optimization. The new implementation
technology I alluded to should allow the development of RDBMSs that are not so
constrained. Aside from being much more optimizable, true RDBMSs may actually
not need to rely on indexes, and may obviate the need for various data
administration complexities of SQL products.
As to the notion that there are fewer design decisions if
multimedia files are stored in the database, it is an illusion-–the opposite is
actually true (see Chapter 1 of PRACTICAL ISSUES
IN DATABASE MANAGEMENT). As I already explained, files are not at the
model level and do not do away with logical design—nothing does. But I
understand the attraction of such illusions: design requires thinking and
knowledge, and that is hard.
Posted
03/22/02
[ABOUT]
[QUOTES]
[LINKS]