The first time you had
to load a Slowly Changing Dimension table with SSIS you probably thought no
problem. There is a transformation for that. Let us take a look at the pros and
cons and see how it compares to using the hash function.
Why
use an SCD Transformation?
•
External
Feed
•
Legacy
system with no timestamp
•
Change
Data Capture not available
SCD
Transformation
Pros:
•
Wizard
creates necessary code.
Cons:
•
Customization
has to be done again if wizard is rerun.
•
Emphasis
is on Slow.
SCD
with Hash Function
Pros:
•
Very
Fast
•
Customizable
Cons:
•
Initial
Setup has a few more steps then the wizard
Test
Data
•
Source
table with 64,000 records of name and address data.
•
1344
updates (2%)
•
32
new records (0.05%)
•
Type
1 SCD. Change data, no history.
•
Staging
table with new source records
•
Target
table to be updated
SCD
Transformation test
Execute the wizard to
map input and output tables.
Execution time: Almost 23 minutes.
SCD
with Hash Setup
•
Add
Hashcode nvarchar(255) to target table
•
Update
hashcode with hash of data fields
•
Create
an empty update table in Staging DB that is a copy of the target table.
Hash Function
hashcode =
hashbytes('SHA1', FirstName + LastName + Address1 + StateCode + city
+ cast(zipcode as
nvarchar(10)) + comment)
Target table already
contains Hash Code for each row:
When selecting from the
source table, calculate the Hash Code.
SELECT CustID
,FirstName
,LastName
,Address1
,StateCode
,City
,ZipCode
,Comment
,cast(hashbytes('SHA1', FirstName +
LastName + Address1 + StateCode
+ city + cast(zipcode as
nvarchar(10)) + comment) as nvarchar(255)) as hashcodealpha
FROM W_Customer
order by custID
Control flow for hash
SCD:
Data Flow for Hash SCD:
In the Lookup
transformation, get ID and Hashcode from production table. If error (no ID in
production table) perform and insert.
In the Condition Split
transformation, if the hashcodes are different write to update table. If they
are the same ignore record.
Execute the update
script for all changed records:
Update Statement
UPDATE T_Customer
SET FirstName = u.FirstName
,LastName = u.LastName
,Address1 = u.Address1
,StateCode = u.StateCode
,City = u.City
,ZipCode = u.ZipCode
,Comment = u.Comment
,hashcode = hashbytes('SHA1', u.FirstName
+ u.LastName + u.Address1 +
u.StateCode + u.city +
cast(u.zipcode as nvarchar(10)) + u.comment)
from t_customer t
join u_customer u
on t.custid = u.custid
Hash SCD Results:
Execution time: just over 3 seconds
Comparison
As you can see, a little
extra setup with an update table and writing a hash function can increase your
throughput almost 400 times. On a small
table the update went from 23 minutes down to 3 seconds. The hash function also
reduces the number of fields you have to compare to one. If you have a very
wide dimension table, the proportional savings will be even greater.
Give it a try. It will
provide immediate benefits.