I am neither a big fan of XML nor an expert. Over the years I have seen XML used and abused yet another silver bullet to solve all of our problems. I have seen XML fail on many projects because it is too verbose - consuming too much bandwidth and processing power where good ol' fixed-length records would have been fine. I have also seen XML as a good solution to interfacing problems and I (almost) unknowingly use it every day interfacing with web services and configuring systems.

Although XML is interesting, I never found the motivation or need to go out and buy the biggest book on XML that I can find in order to master what should really be a simple markup language. XML gets complicated fast and before you know it you are either confused or an XML bigot (where your confusion is hidden behind a veneer of expertise).[@more@]

I wouldn't call myself a DBA, but I have done enough late night babysitting of production databases to have an idea how my design and architectural decisions impact the database the least scalable of any part of the architecture. So in my opinion XML - as a bloated, misunderstood, misinterpreted, over-hyped and often badly implemented technology should be nowhere near the database. End of discussion no budging. Holding such a view won't get me fired and will win the welcome support of the production DBAs who could always point fingers at my bad SQL syntax and data structures as a reason for poor performance.

I confess that I have used XML in the database in my latest project.

There, I have said it and in the company of expert DBAs who have had enough of people like me putting all sorts of crap into the database. Amongst such company I had better come up with an explanation and fast. Allow me to do so.

There are two situations where XML in SQL 2005 comes in quite handy. They are when implementing classifications and temporal data designs.

SQL 2005 XML for Classifications

One thing that is missing in SQL databases is a construct to handle classifications. It is a problem that designers always encounter and the mechanisms for the physical implementations vary extensively.

What do I mean by a classification? Consider a system that has an order and a customer table, where the key from the customer is put into the order table simple normalization. Now what if your customer can be a person, with a first name, surname and date of birth or it can be an organization, with a name and company registration number? The problem is that it doesn't feel right to create a single table with all of the attributes (breaking some relational model rules in the process) and it is difficult to implement separate [Person] and [Organization] tables.

There are two basic approaches, either a roll-up or a roll-down approach.

In the roll-up approach a single table is created with all of the fields and discriminator or classification attributes to distinguish between the classes. The individual classes can then be implemented as views as follows:

CREATE TABLE 
	Stakeholder(
StakeholderId int,
Firstname varchar(50),
Surname varchar(100),
DateOfBirth datetime,
Name varchar(100),
RegistrationNo varchar(20),
StakeholderClass int,
isPerson bit,
isOrganization bit
)
GO
CREATE VIEW Person
AS
SELECT 
	StakeholderId, Firstname, Surname, DateOfBirth
FROM Stakeholder
WHERE isPerson=1
GO
CREATE VIEW 
	Organization
AS
SELECT 
	StakeholderId, Name, RegistrationNo
FROM Stakeholder
WHERE 
	isOrganization=1

In the roll-down approach multiple tables are created with their correct attributes and the 'leaf' tables are unioned together for an index table as follows:

CREATE TABLE 
	Person(
StakeholderId int,
Firstname varchar(50),
Surname varchar(100),
DateOfBirth datetime
)
 
CREATE TABLE 
	Organization(
StakeholderId int,
Name varchar(100),
RegistrationNo varchar(20),
)
 
CREATE VIEW 
	Stakeholder
AS
SELECT 
	StakeholderId, Firstname+' '+Surname
	AS Name, CAST(1
	AS bit) AS
	isPerson, CAST(0
	AS bit) AS
	isOrganization
FROM Person
UNION
SELECT 
	StakeholderId, Name,
	CAST(0 AS bit)
	AS isPerson, 
	CAST(1 AS bit)
	AS isOrganization
FROM Organization

Combinations of the two approaches also exist where some fields are created on an index table and other fields exist only on the leaf tables. Things get a bit complicated when implementing complex classification structures. What if some persons are employees (add an employee number) and what if some employees are contractors (add contract period)? Not only do complex structures become difficult to implement but CRUD across index tables and views becomes a pain.

I have implemented such mechanisms on large databases quite successfully - such as a bank that had 14 million customers. But it can become quite complex and I was looking for a simple classification mechanism that would be reasonably easy to implement, not contain excessive tables or attributes and would be able to be extended or replaced. Enter the XML temptress

I create a simple index table with a column to store all the class-specific attributes as follows:

CREATE TABLE 
	Stakeholder(
StakeholderId int 
	IDENTITY(1,1),
Name varchar(100),
ClassAttributes xml,
isPerson bit,
isOrganization bit
)

Since my application data access layer uses only sprocs to access the database, some insert and update sprocs for the person and the organization need to be written. For example with person:

CREATE PROCEDURE 
	InsertPerson
@Firstname varchar(50),
@Surname varchar(100),
@DateOfBirth datetime
AS
DECLARE @Extended 
	xml
 
SET @Extended=''+@FirstName+''+@Surname+''
IF (@DateOfBirth
	IS NOT NULL) 
  BEGIN
    DECLARE @Dob 
	nvarchar(10)
    SET @Dob=CONVERT(nvarchar(10),@DateOfBirth,20)
    SET @Extended.modify('insert 
	{sql:variable("@Dob")} as last into (/person)[1]')
  END
 
INSERT INTO 
	Stakeholder(Name,ClassAttributes,isPerson)
VALUES(@FirstName+' 
	'+@Surname,@Extended,1)

Executing the above sproc like this:

EXEC InsertPerson
	'Joe', 
	'Soap', '1 Jan 1980'

Results in a record with the basic information and an XML structure that neatly contains all of the other bits of information and would store the following XML:

<person>
    <firstName>JoefirstName>
    <surname>Soapsurname>
    <dateOfBirth>1980-01-01dateOfBirth>
person>

Notice the use of the XQuery insert that only adds the attribute if it is not null, resulting in neater looking XML data.

A similar sproc for organization would store XML something like this:

<organization>
    <name>SQLServerCentralname>
    <registrationNo>ABC1234registrationNo>
organization>

My individual [Person] and [Organization] tables are implemented as views like this:

CREATE VIEW Person
AS
SELECT 
	StakeholderId, ClassAttributes.value('(/person/firstName)[1]',
	'varchar(50)')
	AS FirstName, 
  ClassAttributes.value('(/person/surname)[1]',
	'varchar(100)')
	AS Surname, 
  ClassAttributes.value('(/person/dateOfBirth)[1]',
	'datetime') 
	AS DateOfBirth, 
  ClassAttributes.value('(/person/title)[1]',
	'varchar(10)')
	AS Title
FROM Stakeholder
WHERE     (isPerson 
	= 1)
 
CREATE VIEW 
	Organization
AS
SELECT     
	StakeholderId, Name, 
  ClassAttributes.value('(/organization/organizationTypeId)[1]',
	'int') AS
	OrganizationTypeId,
  ClassAttributes.value('(/organization/registrationNo)[1]',
	'varchar(20)')
	AS RegistrationNo
FROM Stakeholder
WHERE (isOrganization 
	= 1)

The views are an interesting implementation in that from a relational model point of view they are valid relations and the syntax to use them will be standard SQL. Consider the query where we want to search on the name of a stakeholder, but with people we need to query the surname and on organizations we need to query the name. The following query, even though it has XML innards is a perfectly valid and understandable query.

SELECT 
	StakeholderId, Name
FROM Organization
WHERE Name LIKE 
	'S%'
UNION
SELECT 
	StakeholderId, Surname
FROM Person
WHERE Surname
	LIKE 'S%'

There are other ways to query the XML directly using XQuery but I want to stay as close to ANSI 92 syntax as possible.

Even though we are storing non-relational data in our SQL database we don't really break that many relational rules. Technically the relational model states that the storage of data is independent of the model so, the argument that the use of views is non-relational is invalid (sort of) - if [Person] and [Organization] are implemented as views, which are valid relations, then we are not breaking any rules.

By now, any real DBA would be thinking This guy is frikkin insane, it will perform like a dog! This is both a correct and incorrect thought no, I am not frikkin insane and yes, performance can be an issue. I would not recommend this approach if you have a structure with millions of records, which would be the case with the stakeholder structure in large enterprises. But what about structures with fewer rows, even in the tens or hundreds of thousands? Maybe a product classification or retail outlet classification would perform adequately? You may also notice in this example that Name is redundant, since it is contained in the XML anyway this has been done on purpose for performance reasons since most of the queries only want the name which is a common attribute, so there is no point in mucking about with the XML.

Another key aspect with regard to performance is understanding the interfaces. In my particular implementation, if I wanted to create proper fields for the attributes there would be no far-reaching impact. The interfaces to the sprocs wouldn't change and the fact that I may have replaced the Person view with a table would make no difference to existing SQL.

Denormalization of Temporal Data Using XML

Consider a requirement stating "For ordered items, display the part name and the stock on hand". This may be turned into a query something like this:

SELECT o.OrderId, 
	o.PartId, o.Quantity, o.Cost, p.Name, p.StockAvailable
FROM OrderItem o
	INNER JOIN Part p
	ON p.PartId=o.PartId
WHERE OrderId=1

As with most requirements it is not specific enough and should read "For ordered items, display the part name and the stock on hand when the order was placed".

The problem with the above query is that part information, particularly the available stock, changes continuously and the query doesn't take this into account. Many similar problems exist with data that is date dependant (temporal data). Again, there are many ways to resolve this problem you could have [Part] movement records and join based on the order date to the movement tables, or you could denormalize your structures and create an [OrderItem] table with the [Part].[Name] and [Part].[StockAvailable] values in fields on [OrderItem]. The most common approach by far is to do neither and land up with all sorts of issues relating to the temporality of the data which puts the integrity of the entire database in question by the users.

Generally, unless I need to create a specific structure to handle temporality where it may be important I tend to take the easy route and denormalize my structures. The problem is figuring out how many of the fields from [Part] should be stored on [OrderItem] to handle all the various combinations of queries that the users may think up in future. Also, it looks ugly when [Part] fields are reproduced on the [OrderItem] table apart from breaking a few relational model rules along the way.

In a recent system there was a need to store some 'snapshot' temporal data, but since other parts of the system were not specified, never mind developed, we were unsure which fields to store the solution was to store most of them in an XML field and worry about it later.

So in the above example I would create a table something like this:

CREATE TABLE 
	OrderItem(
OrderId int,
PartId int,
Quantity int,
Cost money,
PartStore xml)

With sprocs to handle the inserts and updates,

CREATE PROCEDURE 
	InsertOrderItem
@OrderId int,
@PartId int,
@Quantity int,
@Cost money
AS
DECLARE @PartStore 
	xml
SELECT @PartStore=''+Name+''+CAST(StockAvailable
	AS varchar(10))+''
FROM Part
WHERE PartId=@PartId
 
INSERT INTO 
	OrderItem(OrderId,PartId,Quantity,Cost,PartStore)
VALUES(@OrderId,@PartId,@Quantity,@Cost,@PartStore)

This would create a store of temporal data something like this

<part>
    <name>Part 
	1name>
    <stockAvailable>10stockAvailable>
part>

When querying data I simply look in the XML for the part data that I need,

SELECT OrderId, 
	PartId, Quantity, Cost,
  PartStore.value('(/part/name)[1]',
	'varchar(50)')
	AS Name,
  PartStore.value('(/part/stockAvailable)[1]',
	'int') AS
	StockAvailable
FROM OrderItem
WHERE OrderId=1

And as per the classification examples above, I can wrap the queries into nicely named views if I prefer.

The plan is that over time, provided my interfaces remain the same, I can add Part attributes directly to the OrderItem table if needed. This can be done on a production database and I would just need to alter the table,

ALTER TABLE 
	OrderItem ADD PartName
	varchar(50)
 
UPDATE OrderItem
SET PartName=PartStore.value('(/part/name)[1]',
	'varchar(50)')

and change any sprocs or views that reference the table all very backwards compatible.

Download the code

Summary

A year ago I would have recoiled at the mere suggestion of using XML directly in the database, thinking that it was the domain of junior asp developers who can't tell a relation from a relationship. Over the last few months the XML Temptress in SQL 2005 has lured me into her parlour providing a mechanism to approach old problems in new ways.

I would still be in line violently opposing persisting objects as XML in the database as object oriented bigots would be tempted to do, after all, they say that the database is just a persistence mechanism for their objects. I can picture object orientation bigots advocating a database of one table with two attributes, ObjectId (GUID) and ObjectData (XML) such an image is concerning.

However, I hope that I have demonstrated that if carefully thought through that XML in the database can be useful and elegant provided that it is done within the context of the overall architecture. The biggest issue with XML in the database is understanding when performance becomes the overriding factor as to where and how data is stored after all it is mostly the performance of databases that your DBA has to deal with anyway.

Simon Munro

Simon Munro is currently pursuing the Microsoft Certified Architect (MCA) certification and maintains a blog at http://www.delphi.co.za/ that covers many interesting ideas on using Microsoft technologies.

The SQL Server 2005 XML Temptress

SQL 2005 XML for Classifications

Denormalization of Temporal Data Using XML

Summary