By Julie Smith
Happy 2014! Happy to report that Audrey and I were both renewed as SQL Server MVPs today! To celebrate I’m publishing a really long blog post.
One of the most common scenarios encountered while ETL’in is an Incremental Load –that is determining if a source row already exists in the target database (often your warehouse), and if it does, has it changed since the last time it was loaded? The pseudo code goes like this:
Andy Leonard has written and spoken a lot on this topic. Borrowed with his permission is his logical diagram illustrating the concept:
A pattern that many SSIS developers and one similar to the one Andy’s post goes like this:
I read the source in with an oledb source component, then compare the source to the destination rows using a lookup component. I use the lookup option Redirect to NoMatch if the business keys don’t match and simply load those rows. The fun part comes from the path that matches. Let’s start by taking a look at the components of the dataflow:
Click the thumbnail below to enjoy your first optional, informational, random Chuckles interlude. Otherwise carry on.
And We’re Back!
For this example I created a staging table in the AdventureworksDW2008R2 database called [stg].[Prod]. It is going to be our source. Here is a shot of select * from it:
As you can see, this is an abbreviated version of the full DimProduct tables from the AdventureworksDW2008R2 database. I also created an abbreviated target table, DimProd, which is a copy of DimProduct with fewer columns: here is the same shot of our target, select * from DimProd:
Here is what the ole source contains (A)
Now B the Lookup:
From the General tab of the lookup component:
The important thing here is to change the “how to handle rows with no matching entries”: we want Redirect rows to no match output. (Redirect rows to no match output is not available in SSIS 2005)
Here is the Connections tab contents for the lookup:
What are the reasons for the columns in the lookup against the destination? We are going to need the surrogate key (ProductKey) for when we do find a changed row and have to deactivate the old one in the warehouse. We need the ProductAlternateKey for the join to the source, as this column is the business or natural key of the table. The other columns are change attributes, meaning we are going to have to compare those columns to the source if we find a matching business key. If and only if the change attribute columns are different between source and destination do we need further action. In the Lookup we are going to join the bk in the columns tab. We are going to select the Surrogate key and the Change attributes and alias them by prepending ‘lkp’ to each of them:
Columns mapping and output alias:
When the lookup yields no match, then the pattern is easy–simply load the rows: This is the mapping of (C)—ole db destination
The next step, step D is the actual component that contains the main reason for this post. Seems like a good time for a Chuckle, no?
and again we’re back. Apologies to anyone under the age of 40.
Figure D –Conditional Split.
A conditional split allows us to use the values obtained from the lookup and put the data down two paths—one path where there are one or more changes to any of the Change Attributes which we’ll call Delta and the other where there is no change, called the Ignore path. Only the rows in the Delta path require any further action.
Logically it goes like this:
Delta path occurs when, of the rows that match, one or more change attributes column has a different value between the source and the destination. So let’s look at the column FinishedGoodsFlag. From the source path it is self named, FinishedGoodsFlag. When the lookup component finds a match on the bk of ProductAlternateKey, then the value in the destination is added to the row in the buffer with the name lkpFinishedGoodsFlag. So the Delta condition would be satisifed when FinishedGoodsFlag does not equal lkpFinishedGoodsFlag OR
So here is the SSIS Expression we can use for the first change atttribute:
FinishedGoodsFlag != lkpFinishedGoodsFlag
Then we would string them together by adding an || operator for logical OR
So the whole SSIS Expression inside the Delta condition would be this, on one line, so really it would look like this (adding parenthesis for clarity)
(WeightUnitMeasureCode != lkpWeightUnitMeasureCode) || (SizeUnitMeasureCode != lkpSizeUnitMeasureCode) || (ProductName != lkpProductName) || (StandardCost != lkpStandardCost) || (FinishedGoodsFlag != lkpFinishedGoodsFlag) || (Color != lkpColor)
So this works fine, UNLESS there is a NULL in the dataflow. When the Conditional split evaluates anything with a NULL, it will return NULL and break the conditional split. So we have to also add NULL handling in the derived columns.
How we are going do that is by adding two NULL tests and SSIS conditional expressions, one on each side of the equation, to return a safe value for a NULL. So if the datatype is a number of any kind, we’re going to fill on both sides of the equation with a 0. If it’s a character column of any kind, we’ll safety fill with empty string “”. If it’s any kind of date, we’ll use 1/1/1900, and if it’s a bit, we’ll use FALSE. Providing a safe value for NULLs allows the conditional split to correctly compare the two values. Let’s return back to FinishedGoodsFlag, which is a boolean datatype. here is what we need to do for it:
The below shot shows the original expression on the top, with the changes made on the bottom highlighted. I’ve added a box around the core of the expression so you can see what’s happening:
How to read this?
<!--DVFMTSC-->(ISNULL(FinishedGoodsFlag) ? FALSE : FinishedGoodsFlag) != (ISNULL(lkpFinishedGoodsFlag) ? FALSE : lkpFinishedGoodsFlag) <!--DVFMTSC-->
Starting from the left.
Is FinishedGoodsFlag NULL? if yes, then replace NULL with FALSE, if not then use the actual value of FinishedGoodsFlag.
Is lkpFinishedGoodsFlag NULL? if yes, then replace NULL with FALSE, if not then use the actual value of lkpFinishedGoods
Now take the two values we’ve arrived at for FinishedGoodsFlag and lkpFinishedGoodsdFlag and compare them. Are they unequal? Are any of the other sets of change attribute comparisons unequal? If any are unequal, then it’s an Delta row. If not then the row is an Ignore row.
**the above expression works in all version of SSIS. For 2012, there is a new expression called REPLACENULL, which is simpler to use. It would go like this: (REPLACENULL(FinishedGoodsFlag, FALSE) != REPLACENULL(lkpFinishedGoodsFlag, FALSE))
So, believe it or not, all of that expression writing is kind of a pain in the butt to write. True story. Especially considering that your default value for NULL replacements varies depending on the data type. So I finally wrote a bit of t-sql to help me with this based on information schema of the Destination table and it made the whole process a lot easier (AND OH MY GOD THIS IS THE POINT OF THIS POST):
Here is the output:
Now, this is pretty helpful by itself, all that is left to do is copy and paste each line and place the || in between each line– but we could even take this a step further and get the entire statement ready for pasting into the expression box by using the “black box xml” technique to concatenate these together and add the ||s. For a complete rundown on this bit and how it works see my post here.
The output of the above SQL will give us the entire expression you see here in the Conditional Split:
In our package, nothing is going to happen to rows which go down the Conditional Split Default Output path.
In this example, we’re doing Type 2 changes. That means we want a history of the data, so we deactivate the old row and add the new values as a fresh row. So to finish, we’re going to add the multicast component (E) to create two copies of the data flow. On the right side (G) we’ll simply use an oledb command to deactivate the old rows using the surrogate key (lkpProductKey) as you see below:
Because they are canned cursors, I only use OleDB commands if I have profiled the data and feel confident that there won’t be a lot of changes per batch. My rule of thumb, and it’s arbitrary, is around 10,000 rows. If I see more than that many changes per run, then I would land the changes in an intermediate table, then use the table to do a sql update in another task after this dataflow. That also requires a truncate table statement before the dataflow.
While G is deactivating the old rows, (F) is landing the “Delta” rows as brand new rows, which is exactly like step C.
and here is the concluding Chuckle:
Download the solution here. You will need a copy of AdventurewordsDW2008R2 from Codeplex. I wrote the SSIS Solution in VS 2012.
To simply copy the SQL used to create the SSIS Expressions, go to this post: Click me for Code!