Faster Slowly Changing Dimension with Hash Function - Bob Blackburn


By Bob Blackburn

The first time you had to load a Slowly Changing Dimension table with SSIS you probably thought no problem. There is a transformation for that. Let us take a look at the pros and cons and see how it compares to using the hash function.

Why use an SCD Transformation?
        External Feed
       Legacy system with no timestamp
       Change Data Capture not available

SCD Transformation
Pros:
        Wizard creates necessary code.
 Cons:
        Customization has to be done again if wizard is rerun.
       Emphasis is on Slow.

SCD with Hash Function
Pros:
       Very Fast
       Customizable
Cons:
       Initial Setup has a few more steps then the wizard

Test Data
        Source table with 64,000 records of name and address data.
       1344 updates (2%)
       32 new records (0.05%)
       Type 1 SCD. Change data, no history.
       Staging table with new source records
       Target table to be updated


SCD Transformation test

Execute the wizard to map input and output tables.








Execution time:  Almost 23 minutes.


SCD with Hash Setup
        Add Hashcode nvarchar(255) to target table
       Update hashcode with hash of data fields
       Create an empty update table in Staging DB that is a copy of the target table.

Hash Function

hashcode = hashbytes('SHA1', FirstName + LastName + Address1 + StateCode + city
+ cast(zipcode as nvarchar(10)) + comment)

Target table already contains Hash Code for each row:








When selecting from the source table, calculate the Hash Code.

SELECT CustID
      ,FirstName
      ,LastName
      ,Address1
      ,StateCode
      ,City
      ,ZipCode
      ,Comment
      ,cast(hashbytes('SHA1', FirstName + LastName + Address1 + StateCode
                   + city + cast(zipcode as nvarchar(10)) + comment) as nvarchar(255)) as hashcodealpha
  FROM W_Customer
order by custID

Control flow for hash SCD:










Data Flow for Hash SCD:

In the Lookup transformation, get ID and Hashcode from production table. If error (no ID in production table) perform and insert.

In the Condition Split transformation, if the hashcodes are different write to update table. If they are the same ignore record.

Execute the update script for all changed records:





















Update Statement

UPDATE T_Customer
   SET FirstName = u.FirstName
      ,LastName = u.LastName
      ,Address1 = u.Address1
      ,StateCode = u.StateCode
      ,City = u.City
      ,ZipCode = u.ZipCode
      ,Comment = u.Comment
      ,hashcode = hashbytes('SHA1', u.FirstName + u.LastName + u.Address1 +        u.StateCode  + u.city + cast(u.zipcode as nvarchar(10)) + u.comment)
 from t_customer t
 join u_customer u
   on t.custid = u.custid


Hash SCD Results:


































Execution time:  just over 3 seconds

Comparison

As you can see, a little extra setup with an update table and writing a hash function can increase your throughput almost 400 times.  On a small table the update went from 23 minutes down to 3 seconds. The hash function also reduces the number of fields you have to compare to one. If you have a very wide dimension table, the proportional savings will be even greater.

Give it a try. It will provide immediate benefits.