|A common application requirement is support of searches based on some lengthy string, or even worse, a concatenation of a string and one or two integers, e.g., customer_name + start_date + stop_date. In a small table, you might not notice the impact. But suppose the table of interest contains 50 million rows. You will notice the impact – in both storage requirements and in search performance.|
You don’t have to support such searches directly. Even though you accept these criteria as your input values, you don’t have to search 50 million rows to find the matching rows. There is a very slick alternative, known as “hash buckets” or “hash keys”.
A hash is the result of an algorithm applied to a given string that results in an integer. You feed said algorithm the string of interest, and you get back an integer.
Hash functions do not guarantee that their hash values are unique. Rather, they simply compress the expressed value into an integer. For example, an index based on customer_name (varchar(40))+ start_date (datetime)+ stop_date (datetime) will occupy 40 + 4 + 4 = 48 bytes per row. 50 million rows times 48 bytes yields 2,400,000,000 bytes.
Granted, in order to house 50 million rows of even the most basic data, you need serious hardware. But this index imposes a huge storage penalty – not to mention the performance cost of searching it.
Hash buckets to the rescue!
A hash bucket “hashes” the value of its input, returning an integer. Integer searches are very fast. Therefore, when confronted with indexing a very large number of rows on such a large index key, an excellent solution is to hash the key rather than create an index based literally on it.
Let’s explore this concept. The hash-bucket index corresponding to the example key described above would result in 50 million values times four bytes each = 200,000,000 bytes – a tenth the size, give or take a little.
The former index is guaranteed to be unique, but look at its cost. The latter index is not guaranteed to be unique. Rather, it uses the hash-bucket value to radically reduce the number of candidate rows. The index on the hash-bucket will typically contain duplicates, but using it as the first criterion reduces the range of visited rows dramatically.
Now, the question becomes, how to generate a hash-key? SQL Server provides an easy way, in the function Checksum(). Checksum() accepts a collection of parameters and does its work, returning an integer.
Try it out using the AdventureWorks sample database that ships with SQL Server 2005. Run this query:
This results in the following rows (clipped to 10 for brevity):
In this case, given such a small table, the results happen to be unique, but there is no guarantee of uniqueness in calling the Checksum() function. Therefore, you should assume duplicates. But this is not a problem. Suppose the above query is written as a stored procedure which accepts two arguments, @Name and @GroupName. Suppose that the hash-bucket is indexed. Then imagine this:
Conceivably, the specified hash-key value will result in duplicate rows. However, this integer search will be very fast, and the additional qualifiers will then reduce the result-set to a single row. So in 50 million rows, you might initially visit 10 rows, then reduce the 10 to 1 row.
Chances are, you do not have a table handy that contains 50 million rows. But you can exploit this technique with many fewer rows than this example presents.
The final thing to add to this scenario is that you don’t even have to store the hash keys! Instead, create a calculated column whose value is the hash of your columns, then index the calculated column. This will result in lightning-quick searches of that index as the starting point of your search, which is then refined by the additional qualifiers. The gain in performance is truly dramatic.