Cage Match I: Anti-Joins

Before we get started on the Cage Match, a quick shout-out to my blogging partner and fellow Datachix, Julie Smith (Blog | Twitter).  She’ll be speaking at SQL Rally (May 11-13).  I’ve seen her slides, and man, are you guys in for a treat.  Attend her session for some SSIS and Datachix-y Awesomeness!

 On to your regularly scheduled blog post…

One of the coolest things about writing and presenting is every time I present, I end up learning something new. Either someone in the audience points out something I hadn’t thought of, or a question spawns a whole new line of thinking. Both scenarios are very cool.

Recently, I presented on EXCEPT. I love this concept, because it is very “natural language-y”. It’s simple and effective, and I use it all the time. I had a few comments, mostly around performance and other patterns that can be used to achieve the same results. As a result of these comments, I decided that it was time for the inaugural Datachix Cage Match. So, without further ado, I present to you….

CAGE MATCH I: ANTI-JOINS!

Anti-Joins are used when we want a result set that represents this idea: Give me everything from one result set that doesn’t exist in another result set.

The Players:

NOT IN (The Old Man) – This is probably the first form of anti-join we all learn. It’s easy to understand, and it works. It’s developed a reputation as a resource-eater, and common wisdom is to avoid it.

SELECT Title, LastName 
FROM dbo.ImportBookONE 
WHERE Title +'--'+ LastName NOT IN  
  (SELECT Title +'--'+ LastName FROM dbo.ImportBookTWO );

LEFT JOIN (The Ninja) – This is the pattern that looks least like an Anti-Join. It’s an outer join coupled with a WHERE clause eliminating any rows that didn’t show results in the right query. It can be confusing if seen in the wild, because it isn’t immediately apparent that it’s an anti-join. It’s a favorite of savvy database developers.

SELECT o.Title, o.LastName 
FROM dbo.ImportBookONE o 
  LEFT JOIN dbo.ImportBookTWO t   
    ON o.Title = t.Title 
      AND o.LastName = t.LastName 
WHERE t.Title IS NULL   
  AND t.LastName IS NULL;

NOT EXISTS (The Existentialist) – NOT EXISTS has been around for a long time, and seems to get overlooked pretty often as an Anti-Join option. It comes right out and asks the age-old question, “Does what I’m looking for even exist”?

SELECT Title, LastName 
FROM dbo.ImportBookONE o 
WHERE NOT EXISTS   
  (SELECT t.Title, t.LastName
  FROM dbo.ImportBookTWO t
  WHERE t.Title = o.Title     
    AND t.LastName = o.LastName );

EXCEPT (The New Kid) – EXCEPT has been available since SQL Server 2005, and is a cousin to UNION. It’s the one I presented on, and is a personal favorite. (But I promise we’ll stick to facts for this cage match)

SELECT Title, LastName 
FROM dbo.ImportBookONE   
  EXCEPT 
SELECT Title, LastName 
FROM dbo.ImportBookTWO;

The Setup:

The cage match will consist of 4 rounds:

Round 1: iHeartBooks Match – I take the EXCEPT I wrote for the presentation I did and rewrite the Anti-Join using all 4 options

Round 2: Mini Match – 4 contestants, 10 and 8 row tables, respectively

Round 3: Big Match – 4 contestants, 332,500 rows in one table, 321,416 rows in the second

Round 4: Main Event Match – 4 contestants, 6,760,676 rows in one table, and 6,250,625 rows in the second, with a 2-part business key thrown in for good measure

The Measurements:

CPU Time – The amount of CPU time per query (in milliseconds)

Reads – The number of logical disk reads per query

Duration – The total amount of processing time for the query

Percent of Batch – The percentage of the batch taken up by the query (when all 4 anti-joins are run in a single batch) – this doesn’t really tell us much besides giving us rough comparative numbers, but it’s interesting so I’m keeping it

ROUND 1: iHeartBooks Match

Using the same database I presented with, I run the variations on Anti-Join. Below are the queries competing in Round 1:

LEFT JOIN NOT EXISTS
SELECT o.Title, o.LastName
FROM dbo.ImportBookONE o
LEFT JOIN dbo.ImportBookTWO t
 ON o.Title = t.Title
  AND o.LastName = t.LastName
WHERE t.Title IS NULL
 AND t.LastName IS NULL;
SELECT Title, LastName
FROM dbo.ImportBookONE o
WHERE  NOT EXISTS
(
SELECT t.Title, t.LastName
FROM dbo.ImportBookTWO t
WHERE t.Title = o.Title
AND t.LastName = o.LastName
);
EXCEPT NOT IN
SELECT Title, LastName
FROM dbo.ImportBookONE
EXCEPT
SELECT Title, LastName
FROM dbo.ImportBookTWO;
SELECT Title, LastName
FROM dbo.ImportBookONE
WHERE Title +'--'+ LastName NOT IN
(
SELECT Title +'--'+ LastName
FROM dbo.ImportBookTWO
);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 0 51 40 25
NOT EXISTS 15 51 6 25
EXCEPT 0 51 7 25
NOT IN 0 51 0 25

Hmmm… Not very decisive. Every query had the same number of logical reads? That’s no fun! We’re working with very few rows, and we really end up with no discernable difference between each contestant. On to Round 2!

ROUND 2: Mini Match

Now, we’re going to see if this logical reads thing holds with another set of queries. This time, we’ve got 2 very small tables, one with 10 dates and one with 8 dates. The Anti-Join will return 2 rows.

LEFT JOIN NOT EXISTS
SELECT m.CageMatchDate
FROM ExceptCageMatchTiny m
LEFT JOIN ExceptCageMatchTinySub s
ON m.CageMatchDate = s.CageMatchDate
WHERE s.CageMatchDate IS NULL;
SELECT CageMatchDate
FROM ExceptCageMatchTiny m
WHERE NOT EXISTS
(
SELECT CageMatchDate
FROM ExceptCageMatchTinySub s
WHERE m.CageMatchDate = s.CageMatchDate
);
EXCEPT NOT IN
SELECT CageMatchDate
FROM ExceptCageMatchTiny
EXCEPT
SELECT CageMatchDate
FROM ExceptCageMatchTinySub;
SELECT CageMatchDate
FROM ExceptCageMatchTiny
WHERE CageMatchDate NOT IN
  (SELECT CageMatchDate
  FROM ExceptCageMatchTinySub);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 0 31 5 19
NOT EXISTS 16 31 6 18
EXCEPT 0 31 7 45
NOT IN 15 31 6 18

So what have our first two rounds of the cage match taught us? Well, if you’re working with very small data sets, you probably won’t see much difference in performance. Look at the differential between overall duration – almost nothing. This is a great example of the trap that we as database developers tend to fall into. We work with small sample databases during development, and it allows us to be lazy. Every option looks similar, and we go with what’s familiar instead of what’ll scale. Watch what happens in Rounds 3 & 4. Also, a great example of why only looking at Percent of Batch in SSMS can be misleading… 45% of the batch was EXCEPT, even though CPU time registered 0 and overall Duration was only 7? If you’re interested in seeing a complete picture of what’s happening on a server when you hit F5 in SSMS, take the time to kick off Profiler.

ROUND 3: Big Match

Ah, now the pros are entering the cage! We’re working with over 300,000 rows in each of our two tables, and there might be some blood spilled (finally!).

LEFT JOIN NOT EXISTS
SELECT m.CageMatchDate
FROM ExceptCageMatchMain m
LEFT JOIN ExceptCageMatchSecondary s
ON m.CageMatchDate = s.CageMatchDate
WHERE s.CageMatchDate IS NULL;
SELECT CageMatchDate
FROM ExceptCageMatchMain m
WHERE NOT EXISTS
(
SELECT CageMatchDate
FROM ExceptCageMatchSecondary s
WHERE m.CageMatchDate = s.CageMatchDate
);
EXCEPT NOT IN
SELECT CageMatchDate
FROM ExceptCageMatchMain
EXCEPT
SELECT CageMatchDate
FROM ExceptCageMatchSecondary;
SELECT CageMatchDate
FROM ExceptCageMatchMain
WHERE CageMatchDate NOT IN
 (SELECT CageMatchDate
 FROM ExceptCageMatchSecondary)

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 389 1912 436 7
NOT EXISTS 390 1912 372 7
EXCEPT 624 1912 705 12
NOT IN 1262 1001148 880 74

Ah, the Old Man’s starting to show some wear. Logical reads for NOT IN went through the roof, and if you look at the Execution Plan, you can see why:

Poor guy had to spool out to a temporary table to hold the rows. None of the other contestants ran into this, and it absolutely killed NOT IN. I think he might be on his last leg. Very interesting that LEFT JOIN, NOT EXISTS, and EXCEPT are still showing the same number of logical reads. I was surprised by this.

ROUND 4: The Main Event

Finally, the round you’ve all been waiting for! A chance to see a no holds barred showdown between each of the contestants. So far, NOT IN has fallen behind, but LEFT JOIN, NOT EXISTS, and EXCEPT have both held up well. In this round, we’re working with two tables containing well over 6 million rows each. Just to make it more challenging, we’ve set up a two-part key to evaluate. Let’s see the final queries and results:

LEFT JOIN NOT EXISTS
SELECT m.CageMatchAlpha,  m.CageMatchNumber
FROM ExceptCageMatchGinormous m
LEFT JOIN ExceptCageMatchGinormousSub s
ON m.CageMatchAlpha = s.CageMatchAlpha
AND m.CageMatchNumber = s.CageMatchNumber
WHERE s.CageMatchAlpha IS NULL
AND s.CageMatchNumber IS NULL;
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous m
WHERE NOT EXISTS
(
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormousSub s
WHERE m.CageMatchAlpha = s.CageMatchAlpha
AND m.CageMatchNumber = s.CageMatchNumber
);
EXCEPT NOT IN
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous
EXCEPT
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormousSub;
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous
WHERE CageMatchAlpha+'-'
  +CAST(CageMatchNumber as varchar(10))
NOT IN
 (SELECT CageMatchAlpha+'-'
  +CAST(CageMatchNumber as varchar(10))
 FROM ExceptCageMatchGinormousSub);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch SSMS Run Time
LEFT JOIN 12446 30844 13029 8 0:14
NOT EXISTS 11483 30840 13917 7 0:13
EXCEPT 19456 30848 19988 10 0:20
NOT IN 60216 20363994 116674 76 1:52

Aha! Finally, we see some real difference in performance. The Old Man (NOT IN) has been knocked out cold. But, surprisingly, the other 3 contestants held up surprisingly well. EXCEPT started to show some cracks in the millions-of-rows arena and lagged behind LEFT JOIN and NOT EXISTS. I think I’m going to have to call it a tie between The Ninja and The Existentialist, with The New Kid in a close third.

This post was fun to research and fun to write. But there’s a moral to the story besides “NOT IN = Bad and LEFT JOIN = Good”. If you’re in a development environment, think about the size of your development data set versus what you expect your production data to look like. (Yeah, you heard me… Capacity Planning) Develop using best practices, and avoid lazy pitfalls (Like NOT IN). And most importantly, take the time to try different options. I mean, who hasn’t dreamed of refereeing a cage match?

Query on, my friends….

–Audrey

3 thoughts on “Cage Match I: Anti-Joins”

  1. Nicely done. I wonder if any of the poor performance of the “NOT IN” examples is due to the use of string concatenation. Would it be possible to write the “NOT IN” query without it?

Leave a Reply