Cage Match I: Anti-Joins

Before we get started on the Cage Match, a quick shout-out to my blogging partner and fellow Datachix, Julie Smith (Blog | Twitter).  She’ll be speaking at SQL Rally (May 11-13).  I’ve seen her slides, and man, are you guys in for a treat.  Attend her session for some SSIS and Datachix-y Awesomeness!

 On to your regularly scheduled blog post…

One of the coolest things about writing and presenting is every time I present, I end up learning something new. Either someone in the audience points out something I hadn’t thought of, or a question spawns a whole new line of thinking. Both scenarios are very cool.

Recently, I presented on EXCEPT. I love this concept, because it is very “natural language-y”. It’s simple and effective, and I use it all the time. I had a few comments, mostly around performance and other patterns that can be used to achieve the same results. As a result of these comments, I decided that it was time for the inaugural Datachix Cage Match. So, without further ado, I present to you….

CAGE MATCH I: ANTI-JOINS!

Anti-Joins are used when we want a result set that represents this idea: Give me everything from one result set that doesn’t exist in another result set.

The Players:

NOT IN (The Old Man) – This is probably the first form of anti-join we all learn. It’s easy to understand, and it works. It’s developed a reputation as a resource-eater, and common wisdom is to avoid it.

SELECT Title, LastName 
FROM dbo.ImportBookONE 
WHERE Title +'--'+ LastName NOT IN  
  (SELECT Title +'--'+ LastName FROM dbo.ImportBookTWO );

LEFT JOIN (The Ninja) – This is the pattern that looks least like an Anti-Join. It’s an outer join coupled with a WHERE clause eliminating any rows that didn’t show results in the right query. It can be confusing if seen in the wild, because it isn’t immediately apparent that it’s an anti-join. It’s a favorite of savvy database developers.

SELECT o.Title, o.LastName 
FROM dbo.ImportBookONE o 
  LEFT JOIN dbo.ImportBookTWO t   
    ON o.Title = t.Title 
      AND o.LastName = t.LastName 
WHERE t.Title IS NULL   
  AND t.LastName IS NULL;

NOT EXISTS (The Existentialist) – NOT EXISTS has been around for a long time, and seems to get overlooked pretty often as an Anti-Join option. It comes right out and asks the age-old question, “Does what I’m looking for even exist”?

SELECT Title, LastName 
FROM dbo.ImportBookONE o 
WHERE NOT EXISTS   
  (SELECT t.Title, t.LastName
  FROM dbo.ImportBookTWO t
  WHERE t.Title = o.Title     
    AND t.LastName = o.LastName );

EXCEPT (The New Kid) – EXCEPT has been available since SQL Server 2005, and is a cousin to UNION. It’s the one I presented on, and is a personal favorite. (But I promise we’ll stick to facts for this cage match)

SELECT Title, LastName 
FROM dbo.ImportBookONE   
  EXCEPT 
SELECT Title, LastName 
FROM dbo.ImportBookTWO;

The Setup:

The cage match will consist of 4 rounds:

Round 1: iHeartBooks Match – I take the EXCEPT I wrote for the presentation I did and rewrite the Anti-Join using all 4 options

Round 2: Mini Match – 4 contestants, 10 and 8 row tables, respectively

Round 3: Big Match – 4 contestants, 332,500 rows in one table, 321,416 rows in the second

Round 4: Main Event Match – 4 contestants, 6,760,676 rows in one table, and 6,250,625 rows in the second, with a 2-part business key thrown in for good measure

The Measurements:

CPU Time – The amount of CPU time per query (in milliseconds)

Reads – The number of logical disk reads per query

Duration – The total amount of processing time for the query

Percent of Batch – The percentage of the batch taken up by the query (when all 4 anti-joins are run in a single batch) – this doesn’t really tell us much besides giving us rough comparative numbers, but it’s interesting so I’m keeping it

ROUND 1: iHeartBooks Match

Using the same database I presented with, I run the variations on Anti-Join. Below are the queries competing in Round 1:

LEFT JOIN NOT EXISTS
SELECT o.Title, o.LastName
FROM dbo.ImportBookONE o
LEFT JOIN dbo.ImportBookTWO t
 ON o.Title = t.Title
  AND o.LastName = t.LastName
WHERE t.Title IS NULL
 AND t.LastName IS NULL;
SELECT Title, LastName
FROM dbo.ImportBookONE o
WHERE  NOT EXISTS
(
SELECT t.Title, t.LastName
FROM dbo.ImportBookTWO t
WHERE t.Title = o.Title
AND t.LastName = o.LastName
);
EXCEPT NOT IN
SELECT Title, LastName
FROM dbo.ImportBookONE
EXCEPT
SELECT Title, LastName
FROM dbo.ImportBookTWO;
SELECT Title, LastName
FROM dbo.ImportBookONE
WHERE Title +'--'+ LastName NOT IN
(
SELECT Title +'--'+ LastName
FROM dbo.ImportBookTWO
);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 0 51 40 25
NOT EXISTS 15 51 6 25
EXCEPT 0 51 7 25
NOT IN 0 51 0 25

Hmmm… Not very decisive. Every query had the same number of logical reads? That’s no fun! We’re working with very few rows, and we really end up with no discernable difference between each contestant. On to Round 2!

ROUND 2: Mini Match

Now, we’re going to see if this logical reads thing holds with another set of queries. This time, we’ve got 2 very small tables, one with 10 dates and one with 8 dates. The Anti-Join will return 2 rows.

LEFT JOIN NOT EXISTS
SELECT m.CageMatchDate
FROM ExceptCageMatchTiny m
LEFT JOIN ExceptCageMatchTinySub s
ON m.CageMatchDate = s.CageMatchDate
WHERE s.CageMatchDate IS NULL;
SELECT CageMatchDate
FROM ExceptCageMatchTiny m
WHERE NOT EXISTS
(
SELECT CageMatchDate
FROM ExceptCageMatchTinySub s
WHERE m.CageMatchDate = s.CageMatchDate
);
EXCEPT NOT IN
SELECT CageMatchDate
FROM ExceptCageMatchTiny
EXCEPT
SELECT CageMatchDate
FROM ExceptCageMatchTinySub;
SELECT CageMatchDate
FROM ExceptCageMatchTiny
WHERE CageMatchDate NOT IN
  (SELECT CageMatchDate
  FROM ExceptCageMatchTinySub);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 0 31 5 19
NOT EXISTS 16 31 6 18
EXCEPT 0 31 7 45
NOT IN 15 31 6 18

So what have our first two rounds of the cage match taught us? Well, if you’re working with very small data sets, you probably won’t see much difference in performance. Look at the differential between overall duration – almost nothing. This is a great example of the trap that we as database developers tend to fall into. We work with small sample databases during development, and it allows us to be lazy. Every option looks similar, and we go with what’s familiar instead of what’ll scale. Watch what happens in Rounds 3 & 4. Also, a great example of why only looking at Percent of Batch in SSMS can be misleading… 45% of the batch was EXCEPT, even though CPU time registered 0 and overall Duration was only 7? If you’re interested in seeing a complete picture of what’s happening on a server when you hit F5 in SSMS, take the time to kick off Profiler.

ROUND 3: Big Match

Ah, now the pros are entering the cage! We’re working with over 300,000 rows in each of our two tables, and there might be some blood spilled (finally!).

LEFT JOIN NOT EXISTS
SELECT m.CageMatchDate
FROM ExceptCageMatchMain m
LEFT JOIN ExceptCageMatchSecondary s
ON m.CageMatchDate = s.CageMatchDate
WHERE s.CageMatchDate IS NULL;
SELECT CageMatchDate
FROM ExceptCageMatchMain m
WHERE NOT EXISTS
(
SELECT CageMatchDate
FROM ExceptCageMatchSecondary s
WHERE m.CageMatchDate = s.CageMatchDate
);
EXCEPT NOT IN
SELECT CageMatchDate
FROM ExceptCageMatchMain
EXCEPT
SELECT CageMatchDate
FROM ExceptCageMatchSecondary;
SELECT CageMatchDate
FROM ExceptCageMatchMain
WHERE CageMatchDate NOT IN
 (SELECT CageMatchDate
 FROM ExceptCageMatchSecondary)

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 389 1912 436 7
NOT EXISTS 390 1912 372 7
EXCEPT 624 1912 705 12
NOT IN 1262 1001148 880 74

Ah, the Old Man’s starting to show some wear. Logical reads for NOT IN went through the roof, and if you look at the Execution Plan, you can see why:

Poor guy had to spool out to a temporary table to hold the rows. None of the other contestants ran into this, and it absolutely killed NOT IN. I think he might be on his last leg. Very interesting that LEFT JOIN, NOT EXISTS, and EXCEPT are still showing the same number of logical reads. I was surprised by this.

ROUND 4: The Main Event

Finally, the round you’ve all been waiting for! A chance to see a no holds barred showdown between each of the contestants. So far, NOT IN has fallen behind, but LEFT JOIN, NOT EXISTS, and EXCEPT have both held up well. In this round, we’re working with two tables containing well over 6 million rows each. Just to make it more challenging, we’ve set up a two-part key to evaluate. Let’s see the final queries and results:

LEFT JOIN NOT EXISTS
SELECT m.CageMatchAlpha,  m.CageMatchNumber
FROM ExceptCageMatchGinormous m
LEFT JOIN ExceptCageMatchGinormousSub s
ON m.CageMatchAlpha = s.CageMatchAlpha
AND m.CageMatchNumber = s.CageMatchNumber
WHERE s.CageMatchAlpha IS NULL
AND s.CageMatchNumber IS NULL;
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous m
WHERE NOT EXISTS
(
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormousSub s
WHERE m.CageMatchAlpha = s.CageMatchAlpha
AND m.CageMatchNumber = s.CageMatchNumber
);
EXCEPT NOT IN
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous
EXCEPT
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormousSub;
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous
WHERE CageMatchAlpha+'-'
  +CAST(CageMatchNumber as varchar(10))
NOT IN
 (SELECT CageMatchAlpha+'-'
  +CAST(CageMatchNumber as varchar(10))
 FROM ExceptCageMatchGinormousSub);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch SSMS Run Time
LEFT JOIN 12446 30844 13029 8 0:14
NOT EXISTS 11483 30840 13917 7 0:13
EXCEPT 19456 30848 19988 10 0:20
NOT IN 60216 20363994 116674 76 1:52

Aha! Finally, we see some real difference in performance. The Old Man (NOT IN) has been knocked out cold. But, surprisingly, the other 3 contestants held up surprisingly well. EXCEPT started to show some cracks in the millions-of-rows arena and lagged behind LEFT JOIN and NOT EXISTS. I think I’m going to have to call it a tie between The Ninja and The Existentialist, with The New Kid in a close third.

This post was fun to research and fun to write. But there’s a moral to the story besides “NOT IN = Bad and LEFT JOIN = Good”. If you’re in a development environment, think about the size of your development data set versus what you expect your production data to look like. (Yeah, you heard me… Capacity Planning) Develop using best practices, and avoid lazy pitfalls (Like NOT IN). And most importantly, take the time to try different options. I mean, who hasn’t dreamed of refereeing a cage match?

Query on, my friends….

–Audrey

Debugging SSIS Variables Part Two–The Locals Window

From the book of Bloggering, Chapter 59 Verse 11: Thou shalt humbly accepteth better solutions to thine posits as they appeareth in thine Comments Section.   Yea verily, although thou may fearest change, thou shalt learneth new and better coding in this manner and be pleasing in the sight of thine employer.

I had a blog post several weeks ago about debugging the values of SSIS Variables. I shared the method I have always used, which was simply (actually maybe it’s not simple at all) adding a script task into the control flow and placing a message box in the control flow, displaying the value of the variable. Continue reading Debugging SSIS Variables Part Two–The Locals Window

The Case of the Mysterious Failing Packages

Or, that time I pimp-slapped some GUIDs…

Let me set the stage for you. We’re less than a week out from a major production push. We move the SSIS (2008) packages over to the integration testing server to make sure everything is good to go. On each data load, something fails. Never in the same package, never in the same task, never from the same source data. We think it’s the server. We move the packages again. More failures.

I’m losing sleep, breaking out in a weird rash that may or may not be related to my panic, and my boss is at my cube every hour asking how things are going in his, “I’m trying to be supportive, but fix the damned thing already”, voice. I’m rethinking my Don’t Drink During Work policy. This is not good. I just can’t find anything wrong. I can run the packages independently and they’re all happy green boxes. I can run the whole load from the beginning and it’ll get past the point of the prior failure only to crap its pants further down the line. Now, full disclosure: I sort of suck at SSIS. It’s fairly new for me, and while I’m learning, I still have a long way to go. Let’s just say that my self-confidence was not at an, “I’m Okay, You’re Okay”, sort of level.

After the latest failure and an inappropriate-for-the-workplace, profanity-laden outburst from me where I may or may not have referred to SSIS as “the stinking offspring of a poxy two-dollar…”, oh, never mind… I hear a question from the next cube over. “Have you reset the GUIDs?” What? GUIDs? Uh, no. I stand up, plant myself in my co-worker Tom’s cube, and ask for an explanation. He explains that if packages are copied, sometimes the GUIDs can be the same in two packages. Then, if they end up running in parallel, it can screw up memory management and SSIS freaks out. And as we all know, when SSIS ain’t happy, ain’t nobody happy. We use a stock template for all of our packages that is copied and then saved off as the new package. And, since our dimension loads are pretty standard, if we need to add a new dimension, sometimes it’s easier to copy one load package and modify it rather than building one from scratch. Yes, I’ve definitely done some package copying in my short tenure as an SSIS developer.

Jeff wanders over, listens to my plight and confirms that I should definitely try resetting the GUIDs. These two guys, who have since been inducted into my Personal Heroes Hall of Fame, explain that with BIDS Helper (totally free download), I can reset the GUIDs on a package and hit every object within that package. A quick rundown of the feature and why it is useful is here. Luckily, I had already installed it for a couple of other features it offered, so I got to work. Can’t hurt, right? Our ETL process is fairly complex, and we have a lot of data moving around. Dimension loads, fact loads, auditing, staging, the works. It took about an hour between me and another co-worker, Tim, to step through all the packages, check them out of TFS (Team Foundation Server), reset the GUIDs, and check them back in. It was easy. Just right-click on the package, look down to the bottom of the menu, click Reset GUIDs, confirm, and you’re done.

We move the packages with their new and improved GUIDs back over to the integration testing server, and viola!, data’s flying around like Your Mama insults at a middle school. No failures. Let me reiterate: With no other changes, the sporadic failures stopped when we reset all the GUIDs.

Now, I can’t say with 100% certainty that resetting the GUIDs definitely fixed the problem. All I know is that we did it, and the failures stopped. That’s one hell of a coincidence if it wasn’t the issue. This is right up there with the legendary, “Try rebooting the server”, and his lesser known cousin, “Restart the service”. It is something to try when all else fails.

What did I learn? A few things:
1) SSIS does not have a personal vendetta against me. There was no ghost in the machine.
2) BIDS Helper is my friend. Just install it. You’ll love what it gives you.
3) If I copy a package, I make it part of my routine to reset the GUIDs.
4) When in doubt, curse loudly until someone comes up with an idea (just kidding). Seriously, this is a great example of why we should get up from our desks and enlist the ideas and knowledge of those around us. Turns out that I work with some pretty smart guys.