Point-in-time architecture

from the Artful SQL Server & Access Tips List


Point in Time Architecture (PITA) is a database design that supports two related yet distinct concepts: History, and Audit Trails. By History we mean all the information, current and historical, that we now believe to be true. By Audit Trail we mean all the information recorded in the database (and thus believed to be true) at some previous point in time. The distinction is that the Audit Trail shows the history of all database changes.

If you read the database literature on modeling time in databases, you'll find two sets of naming conventions. Richard Snodgrass calls the keeping of accurate history time validity, or temporal validity; he calls the capacity to reconstruct previous database transaction states transaction time validity or PITA; and he calls the combination of the two bitemporal architecture. We follow Ralph Kimball here not because we know a knockdown proof that he's right, but because his usage is popular in the world of SQL Server. (In our chapter on Time and MySQL we follow Snodgrass.)

Support for History and Audit Trails is notably absent from typical OLTP databases. By "typical", we mean databases that support the traditional Select, Insert, Delete and Update operations. These typical OLTP databases destroy data. This is most obvious with the Delete command, but a moment's thought reveals that the Update command is equally destructive. When you update a row in a table, you lose the pre-update values. The core PITA concept is: no information is ever physically lost from the database.

Some updates are more important than others. In all likelihood, the data modeler, DBA, or SQL programmer will not know which updates are important and which unimportant without consultation with the principal stakeholders. A mere spelling error in a person's surname may be deemed unimportant. Unfortunately, there is no way to distinguish a spelling error from a change in surname. A correction to a telephone number may be deemed trivial, but again there is no way to distinguish it from a changed number. What changes are worth documenting, and what other changes are deemed trivial? There is no pat answer.

The Insert statement is indirectly destructive. Suppose you insert ten rows into some table today. Unless you have a column called DateInserted, or similar, then you have no way to present the table as it existed yesterday.

What is Point-In-Time Architecture?

PITA is a database design that works around these problems. As its name implies, PITA attempts to deliver a transactional database that can be rolled back to any previous point in time. I use the term "rolled back" metaphorically: traditional data restorations from backups are unacceptable for this purpose, and traditional rollbacks apply only to points declared within a transaction.

A PITA system must be able to present an image of the database as it existed at any previous instant, without destroying the current image. Here 'instant' means the smallest datetime delta supported by the datetime datatype used to track changes.

Think of it this way: a dozen users are simultaneously interrogating the database, each interested in a different point in time. UserA wants the current database image; UserB wants the image as it existed on the last day of the previous month; UserC is interested in the image of the last day of the previous business quarter; and so on.

Requirements of PITA

Most obviously, physical DELETEs are forbidden. Also, INSERTs must be flagged in such a way that we know when they occurred. Physical UPDATEs are also forbidden; otherwise we lose the image of the rows of interest prior to the Update.

What do we need to know?
  • Who inserted a row, and when
  • Who replaced a row, and when
  • What did the replaced row look like before its replacement?
We can track which rows were changed when in our PITA system by adding standard PITA columns to all tables of PITA interest. I suggest the following:
  • DateCreated – the datetime when the row was inserted
  • DateEffective – the datetime when the row became effective
  • DateEnd – the datetime when the row became inoperative. Some writers recommend implementing 'still in effect' as the largest possible future datetime value, which avoids query slowdowns due to NULLs, but injects arbitrary data. Others recommend implementing 'still in effect' as DateEnd=NULL. I adopt that convention here.
  • DateReplaced – the datetime when the row was replaced by another row
  • OperatorCode – the unique identifier of the person (or system) that created the row
Notice that we have both a DateCreated column and a DateEffective column, which could be different, for example, when a settlement is achieved between a company and a union guaranteeing specific wage increases effective on a series of dates. We might know a year or two in advance that certain wage increases will kick in on specific dates. Therefore we might add the row some time in advance of its DateEffective. By distinguishing DateCreated from DateEffective, we circumvent this problem.

Dealing with inserts

The easiest command to deal with is Insert. Here, we simply make use of our DateCreated column, using either a Default value or an Insert trigger to populate it. Thus, to view the data as it stood at a given point in time, you would perform the Select using the following syntax:

SELECT * 
FROM AdventureWorks.Sales.SalesOrderHeader
WHERE DateCreated < [some PITA date of interest]

This scenario is fine and dandy if you are creating the table in question. But you may be called upon to backfill some existing tables.

If you are retrofitting a database to support PITA, then you won't be able to use a Default value to populate the existing rows. Instead you will have to update the existing rows to supply some value for them, perhaps the date on which you execute the Update command. To that extent, all these values will be false. But at least it gives you a starting point. Once the DateCreated column has been populated for all existing rows, you can alter the table, either supplying a Default value for the column, or adding an Insert trigger which gives new rows acquire their DateCreated values automatically.

Dealing with deletes

In a PITA architecture, no rows are physically deleted. We introduce the concept of a "logical delete". We visit the existing row and flag it as "deleted on date z" by updating its DateEnd column with the date on which the row was "deleted". We do not delete the actual row, but merely identify it as having been deleted on a particular date. All Select statements interrogating the table must then observe the value in this column.

SELECT * 
FROM AdventureWorks.Sales.SalesOrderHeader
WHERE DateEnd < [PITA_date]

Any row logically deleted after our PITA date of interest is therefore assumed to have logically existed up to our date of interest, and ought to be included in our result set.

Dealing with updates

In PITA, updates are the trickiest operation. No rows are actually updated (in the traditional sense of replacing the current data with new data). Instead, we perform three actions:
  • Flag the existing row as "irrelevant after date x"
  • Copy the values of the existing row to a temporary buffer
  • Insert a new row, copying most some of its values from the old row (those that were not changed), and using the new values for those columns that were changed. We also supply a new value for the column DateEffective (typically GetDate(), but not always as described previously).
There are several ways to implement this functionality. I chose the Instead-Of Update trigger. Before investigating the code, let's describe the requirements:
  • We must update the existing row so that its DateReplaced value reflects GetDate() or UTCDate(). Its DateEnd value might be equal to GetDate(), or not. Business logic will decide this question.
  • The Deleted and Inserted tables give us the values of the old and new rows, enabling us to manipulate the values.
Here is the code to create a test table and the Instead-Of trigger we need. Create a test database first, and then run this SQL:

CREATE TABLE [dbo].[Test_PITA_Table](
   [TestTablePK] [int] IDENTITY(1,1) NOT NULL,
   [TestTableText] [varchar](50)
       COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
   [DateCreated] [datetime]   
       NOT NULL CONSTRAINT [DF_Test_PITA_Table_DateCreated]
                                      DEFAULT (getdate()),
   [DateEffective] [datetime] NOT NULL,
   [DateEnd] [datetime] NULL,
   [OperatorCode] [varchar](50)      
       COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
   [DateReplaced] [datetime] NULL    
       CONSTRAINT [DF_Test_PITA_Table_DateReplaced]
                                       DEFAULT (getdate()),
 CONSTRAINT [PK_Test_PITA_Table] PRIMARY KEY CLUSTERED
(
   [TestTablePK] ASC
)WITH (PAD_INDEX  = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
Here is the trigger:
CREATE TRIGGER [dbo].[Test_PITA_Table_Update_trg]
-- ALTER TRIGGER [dbo].[Test_PITA_Table_Update_trg]
   ON  [dbo].[Test_PITA_Table]
   INSTEAD OF UPDATE
AS
   SET NOCOUNT ON
   DECLARE @key int
   SET @key = (SELECT TestTablePK FROM Inserted)

   UPDATE Test_PITA_Table
      SET DateEnd = GetDate(), DateReplaced = GetDate()
      WHERE TestTablePK = @key

   INSERT INTO dbo.Test_PITA_Table
   (TestTableText, DateCreated, DateEffective, OperatorCode, DateReplaced)
   (SELECT TestTableText, GetDate(), GetDate(), OperatorCode, NULL
      FROM Inserted)

A real-world example would involve more columns, but I kept it simple so the operations would be clear. With our underpinnings in place, open the table and insert a few rows. Then go back and update one or two of those rows.

Dealing with selects

Every Select statement must take into account the dates just described, so that a query which is interested in, say, the state of the database as it appeared on December 24, 2006, would:
  • Exclude all data inserted or updated since that day
  • Include only data as it appeared on that day. Deletes that occurred prior to that date would be excluded
  • In the case of updated rows, we would be interested only in the last update that occurred prior to the date of interest
This may be trickier than it at first appears. Suppose a given row in a given table has been updated three times prior to the point in time of interest. We'll need to examine all the remaining rows to determine if any of them has been updated or deleted during this time frame, and if so, exclude the logical deletes, and include the logical updates.

With our standard PITA columns in place this may not be as tricky as it first sounds. Remember that at any particular point in time, the rows of interest share the following characteristics:
  • DateCreated is less than or equal to the PITA date of interest.
  • DateEffective is greater than or equal to the PITA date.
  • DateEnd is either null or greater than the PITA date.
  • DateReplaced is either null or less than the PITA date.
So for our row that has been updated three times prior to the PITA date:
  • The first and second rows will have a DateEnd and a DateReplaced that are not null, and both will be less than the PITA date.
  • The third row will have a DateEffective less than the PITA date, and a DateReplaced that is either null or greater than the PITA date.
So we can always query out the rows of interest without having to examine columns of different names, but rather always using the same names and the same semantics.

PITA implementation details

The most important thing to realize is that it may not be necessary to trace the history of every column in a table. First of all, some columns, such as surrogate IDs, assigned dates (e.g. OrderDate), and other columns such as BirthDate will never be changed (other than for corrections). Another example is TelephoneNumber. In most applications, it is not significant that your telephone number changed twice in the past year: what we care about is your current number. Admittedly, some organizations may attach significance to these changes of telephone number. That is why we can only suggest a rule of thumb rather than an iron-clad rule. The stakeholders in the organization will help you decide the columns that are deemed "unimportant".
What then qualifies as an important column? The rule of thumb is that important columns are those that have changeable attributes, and whose changes have significance.

Columns with changeable attributes are often called Slowly Changing Dimensions (SCDs). However, just because an attribute value changes, that doesn't imply that the change is significant to the organization. There are two types of SCD:
  • Type 1 – columns where changes are of little or no interest to the organization
  • Type 2 – columns where changes must be tracked and history recorded.
An obvious example of a Type 2 SCD is EmployeeDepartmentID. Typically, we would want to be able to trace the departments for which an employee has worked. But again, this may or may not be important to a given organization. What we can say is this: it is rarely the case that all columns within a table are considered Type 2.

Once you define Type 1 and Type 2 columns, you can devise the program objects required to handle both types. The Type 1 code won't bother with logical updates; it will perform a simple update., replacing the old value with a new one and not documenting this change in detail. The Type 2 code will follow the rules for logical updates and deletes.

Using domains

Depending on the development tools you use, you may or may not be able to take advantage of domains. (I am a big fan of ERwin and PowerDesigner, and almost never develop a data model without using them, except for the most trivial problems.)
In terms of data-modeling, a domain is like a column definition, except that it is not related to a table. You create a collection of domains, specifying their default values, description, check constraints, nullability and so on, without reference to any given table. Then, when you create individual tables, instead of supplying a built-in data type for a column, you specify its domain, thus "inheriting" all its attributes that you defined earlier. The less obvious gain is that should you need to change the domain definition (for example from int to bigint, or shortdatetime to datetime, or varchar(10) to char(10)), you make the change in exactly one place, and then forward-engineer the database. All instances of the domain in all tables will be updated to correspond to the new domain definition. In a database comprising hundreds of tables, this approach can be a huge time-saver.

Although I love domains, I have found one problem with them. In my opinion, there ought to be two kinds of domains, or rather a double-edged domain. Consider a domain called CustomerID. Clearly, its use in the Customers table as a PK is different than its use in various related tables, as an FK. In the Customer table it might be an int, Identity(1,1), whereas in the related tables, it will still be an int, but not an identity key. To circumvent this problem, I typically create a pair of domains, one for the PK and another for all instances of the domain as an FK.

Sample transactions

There is no recipe for creating a PITA database. However, using the examples provided here, plus some clear thinking and a standard approach, you can solve the problems and deliver a fully compliant PITA system.

Assume a Hollywood actress who marries frequently, and who always changes her surname to match her husband's. In a PITA, her transaction record might look like this:

Table 1: Persons table with PITA columns and comment

PersonID

Given

Surname

DateCreated

DateEffective

DateEnd

DateReplaced

Operator Code

Description

1234

Mary

O'Hara

01-Jan-04

01-Jan-04

 

 

jlarue

Prior to Wedding

1234

Mary

O'Hara

01-Jan-04

01-Jan-04

01-Jun-04

01-Jun-04

jlarue

Ends Maiden Name

2345

Mary

Roberts

01-Jun-04

02-Jun-04

 

 

bhoskins

Adopts New Surname

 

 

 

 

 

 

 

 

 

2345

Mary

Roberts

01-Jun-04

02-Jun-04

12-Dec-05

12-Dec-05

cwebb

Ends Marriage

3456

Mary

Kent

12-Dec-05

12-Dec-05

 

 

cwebb

Remarries and adopts new Surname

 

 

 

 

 

 

 

 

 

3456

Mary

Kent

12-Dec-05

12-Dec-05

06-Jun-06

06-Jun-06

bhoskins

Ends Marriage

4567

Mary

Clark

06-Jun-06

07-Jun-06

 

 

bhoskins

Remarries and adopts new Surname


Here we have the history of Mary's three marriages. Mary O'Hara entered the database on 01-Jan-04. In June of the same year she adopted, through marriage, the surname Roberts. This is reflected in our PITA database with the appropriate value inserted into the DateEnd and DateReplaced columns of Mary's row. We then insert a new row into the Persons table, with a new PersonID value, the updated surname and the correct DateCreated and DateEffective values. This process is repeated for each of Mary's subsequent marriages, so we end up with four rows in the Persons table, all referring to the same "Mary".

These three Primary Keys all point to the same woman. Her surname has changed at various points in time. To this point, we have considered History as referring to the history of changes within the tables. However, this example illustrates another concept of history: the history of a given object (in this case, a person) within the database. Some applications may not need to know this history, while others may consider this critical. Medical and police databases come immediately to mind. If all a criminal had simply to change his surname to evade his history, we would have problems in the administration of justice.

One might handle this problem by adding a column to the table called PreviousPK, and insert in each new row the PK of the row it replaces. This approach complicates queries unnecessarily, in my opinion. It would force us to walk the chain of PreviousPKs to obtain the history of the person of interest. A better approach, I think, would be to add a column called OriginalPK, which may be NULL. A brand-new row would contain a null in this column, while all subsequent rows relating to this person would contain the original PK. This makes it trivial to tie together all instances. We can then order them using our other PITA columns, creating a history of changes to the information on our person of interest.

Table 2: Persons Table with PITA and Original PK tracking column

PersonID

Given

Surname

DateCreated

DateEffective

DateEnd

DateReplaced

Operator Code

OriginalPK

1234

Mary

O'Hara

01-Jan-04

01-Jan-04

 

 

jlarue

 

1234

Mary

O'Hara

01-Jan-04

01-Jan-04

01-Jun-04

01-Jun-04

jlarue

 

2345

Mary

Roberts

01-Jun-04

02-Jun-04

 

 

bhoskins

1234

 

 

 

 

 

 

 

 

 

2345

Mary

Roberts

01-Jun-04

02-Jun-04

12-Dec-05

12-Dec-05

cwebb

1234

3456

Mary

Kent

12-Dec-05

12-Dec-05

 

 

cwebb

1234

 

 

 

 

 

 

 

 

 

3456

Mary

Kent

12-Dec-05

12-Dec-05

06-Jun-06

06-Jun-06

bhoskins

1234

4567

Mary

Clark

06-Jun-06

07-Jun-06

 

 

bhoskins

1234


Given Point-In-Time 21-Dec-2005, then the row of interest is the penultimate row: the row whose DateEffective value is 12-Dec-2005 and whose DateEnd is 06-June-2006. How do we identify this row?

SELECT * FROM Persons
WHERE OriginalPK = 1234
AND DateEffective <= '21-Dec-2005'
AND (DateEnd IS NULL) OR (DateEnd > '21-Dec-2005'

The DateEffective value must be less than or equal to 21-Dec-2005 and whose DateEnd is either NULL or greater than 21-Dec-2005.

Dealing with cascading updates

Let us now suppose that during the course of our history of Mary O'Hara, she changed addresses several times. Her simple changes of address are not in themselves problematic; we just follow the principles outlined above for the PersonAddresses table. If her changes of address correspond to her marriages, however, the waters muddy slightly., because this implies that she has changed both her name and her address. But let's take it one step at a time.

Mary moves from one flat to another, with no other dramatic life changes. We stamp her current row with a DateEnd and DateReplaced (which, again, might differ). We insert a new row in PersonAddresses, marking it with her current PK from the Persons table, and adding the new address data. We mark it with a DateEffective corresponding to the lease date, and leave the DateEnd and DateReplaced null. Should her surname change within the scope of this update then we mark her row in Persons with a DateEnd and a DateReplaced, then insert a new row reflecting her new surname.

Then we add a new row to PersonAddresses, identifying it with Mary's new PK from Persons, and filling in the rest of the data.

Each time Mary's Person row is logically updated, thus requiring a new row with a new PK, so we must logically update the dependent row(s) in the PersonAddresses (and all other related tables with a new row that references the new PK in the Persons table). This also applies to every other table in our database that relate to Persons. Fortunately, we can trace the history of Mary's addresses using the Persons table.

In more general terms, the point to realize here is that every time a Type 2 update occurs in our parent table (Persons, in this case), a corresponding Type 2 update must occur in every related table. How complex these operations will be clearly depends on the particular database and its requirements. Again, there is no hard-and-fast rule to decide this.

Dealing with cascading deletes

A logical delete is represented in PITA as a row containing a not-null DateEnd and a null DateReplaced. Suppose we have a table called Employees. As we know, employees come and go. At the same time, their IDs are probably FKs into one or more tables. For example, we might track SalesOrders by EmployeeID, so that we can pay commissions. A given employee departs the organization. That certainly does not mean that we can delete the row. So we logically delete the row in the Employees table, giving it a DateEnd that will exclude this employee from any lists or reports whose PITA date is greater than said date – and thus preserving the accuracy of lists and reports whose PITA date is prior to the employee's departure.On the other hand, suppose that our firm sells products from several vendors, one of whom goes out of business. We logically delete the vendor as described above, and perhaps we logically delete all the products we previously purchased from said vendor.

Note: There is a tiny glitch here, beyond the scope of this article, but I mention it because you may have to consider what to do in this event. Suppose you still have several units on hand that were purchased from this vendor. You may want to postpone those related deletes until the inventory has been sold. That may require code to logically delete those rows whose QuantityOnHand is zero, and later on to revisit the Products table occasionally until all this vendor's products have been sold. Then you can safely logically delete those Products rows.

Summary

The first time you confront the challenge of implementing Point in Time Architecture, the experience can be quite daunting. But it is not rocket science. I hope that this article has illuminated the steps required to accomplish PITA. As pointed out above, some applications may require the extra step of tracking the history of individual objects (such as Persons), while others may not need this. PITA is a general concept. Domain-specific implementations will necessarily vary in the details. This article, I hope, will serve as a practical guideline. I emphasize that there are rarely hard-and-fast rules for implementing PITA. Different applications demand different rules, and some of those rules will only be discovered through careful interrogation of the stakeholders. You can do it!

Return to the Artful SQL Server & Access tips page