Cage Match I: Anti-Joins

Before we get started on the Cage Match, a quick shout-out to my blogging partner and fellow Datachix, Julie Smith (Blog | Twitter).  She’ll be speaking at SQL Rally (May 11-13).  I’ve seen her slides, and man, are you guys in for a treat.  Attend her session for some SSIS and Datachix-y Awesomeness!

 On to your regularly scheduled blog post…

One of the coolest things about writing and presenting is every time I present, I end up learning something new. Either someone in the audience points out something I hadn’t thought of, or a question spawns a whole new line of thinking. Both scenarios are very cool.

Recently, I presented on EXCEPT. I love this concept, because it is very “natural language-y”. It’s simple and effective, and I use it all the time. I had a few comments, mostly around performance and other patterns that can be used to achieve the same results. As a result of these comments, I decided that it was time for the inaugural Datachix Cage Match. So, without further ado, I present to you….

CAGE MATCH I: ANTI-JOINS!

Anti-Joins are used when we want a result set that represents this idea: Give me everything from one result set that doesn’t exist in another result set.

The Players:

NOT IN (The Old Man) – This is probably the first form of anti-join we all learn. It’s easy to understand, and it works. It’s developed a reputation as a resource-eater, and common wisdom is to avoid it.

SELECT Title, LastName 
FROM dbo.ImportBookONE 
WHERE Title +'--'+ LastName NOT IN  
  (SELECT Title +'--'+ LastName FROM dbo.ImportBookTWO );

LEFT JOIN (The Ninja) – This is the pattern that looks least like an Anti-Join. It’s an outer join coupled with a WHERE clause eliminating any rows that didn’t show results in the right query. It can be confusing if seen in the wild, because it isn’t immediately apparent that it’s an anti-join. It’s a favorite of savvy database developers.

SELECT o.Title, o.LastName 
FROM dbo.ImportBookONE o 
  LEFT JOIN dbo.ImportBookTWO t   
    ON o.Title = t.Title 
      AND o.LastName = t.LastName 
WHERE t.Title IS NULL   
  AND t.LastName IS NULL;

NOT EXISTS (The Existentialist) – NOT EXISTS has been around for a long time, and seems to get overlooked pretty often as an Anti-Join option. It comes right out and asks the age-old question, “Does what I’m looking for even exist”?

SELECT Title, LastName 
FROM dbo.ImportBookONE o 
WHERE NOT EXISTS   
  (SELECT t.Title, t.LastName
  FROM dbo.ImportBookTWO t
  WHERE t.Title = o.Title     
    AND t.LastName = o.LastName );

EXCEPT (The New Kid) – EXCEPT has been available since SQL Server 2005, and is a cousin to UNION. It’s the one I presented on, and is a personal favorite. (But I promise we’ll stick to facts for this cage match)

SELECT Title, LastName 
FROM dbo.ImportBookONE   
  EXCEPT 
SELECT Title, LastName 
FROM dbo.ImportBookTWO;

The Setup:

The cage match will consist of 4 rounds:

Round 1: iHeartBooks Match – I take the EXCEPT I wrote for the presentation I did and rewrite the Anti-Join using all 4 options

Round 2: Mini Match – 4 contestants, 10 and 8 row tables, respectively

Round 3: Big Match – 4 contestants, 332,500 rows in one table, 321,416 rows in the second

Round 4: Main Event Match – 4 contestants, 6,760,676 rows in one table, and 6,250,625 rows in the second, with a 2-part business key thrown in for good measure

The Measurements:

CPU Time – The amount of CPU time per query (in milliseconds)

Reads – The number of logical disk reads per query

Duration – The total amount of processing time for the query

Percent of Batch – The percentage of the batch taken up by the query (when all 4 anti-joins are run in a single batch) – this doesn’t really tell us much besides giving us rough comparative numbers, but it’s interesting so I’m keeping it

ROUND 1: iHeartBooks Match

Using the same database I presented with, I run the variations on Anti-Join. Below are the queries competing in Round 1:

LEFT JOIN NOT EXISTS
SELECT o.Title, o.LastName
FROM dbo.ImportBookONE o
LEFT JOIN dbo.ImportBookTWO t
 ON o.Title = t.Title
  AND o.LastName = t.LastName
WHERE t.Title IS NULL
 AND t.LastName IS NULL;
SELECT Title, LastName
FROM dbo.ImportBookONE o
WHERE  NOT EXISTS
(
SELECT t.Title, t.LastName
FROM dbo.ImportBookTWO t
WHERE t.Title = o.Title
AND t.LastName = o.LastName
);
EXCEPT NOT IN
SELECT Title, LastName
FROM dbo.ImportBookONE
EXCEPT
SELECT Title, LastName
FROM dbo.ImportBookTWO;
SELECT Title, LastName
FROM dbo.ImportBookONE
WHERE Title +'--'+ LastName NOT IN
(
SELECT Title +'--'+ LastName
FROM dbo.ImportBookTWO
);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 0 51 40 25
NOT EXISTS 15 51 6 25
EXCEPT 0 51 7 25
NOT IN 0 51 0 25

Hmmm… Not very decisive. Every query had the same number of logical reads? That’s no fun! We’re working with very few rows, and we really end up with no discernable difference between each contestant. On to Round 2!

ROUND 2: Mini Match

Now, we’re going to see if this logical reads thing holds with another set of queries. This time, we’ve got 2 very small tables, one with 10 dates and one with 8 dates. The Anti-Join will return 2 rows.

LEFT JOIN NOT EXISTS
SELECT m.CageMatchDate
FROM ExceptCageMatchTiny m
LEFT JOIN ExceptCageMatchTinySub s
ON m.CageMatchDate = s.CageMatchDate
WHERE s.CageMatchDate IS NULL;
SELECT CageMatchDate
FROM ExceptCageMatchTiny m
WHERE NOT EXISTS
(
SELECT CageMatchDate
FROM ExceptCageMatchTinySub s
WHERE m.CageMatchDate = s.CageMatchDate
);
EXCEPT NOT IN
SELECT CageMatchDate
FROM ExceptCageMatchTiny
EXCEPT
SELECT CageMatchDate
FROM ExceptCageMatchTinySub;
SELECT CageMatchDate
FROM ExceptCageMatchTiny
WHERE CageMatchDate NOT IN
  (SELECT CageMatchDate
  FROM ExceptCageMatchTinySub);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 0 31 5 19
NOT EXISTS 16 31 6 18
EXCEPT 0 31 7 45
NOT IN 15 31 6 18

So what have our first two rounds of the cage match taught us? Well, if you’re working with very small data sets, you probably won’t see much difference in performance. Look at the differential between overall duration – almost nothing. This is a great example of the trap that we as database developers tend to fall into. We work with small sample databases during development, and it allows us to be lazy. Every option looks similar, and we go with what’s familiar instead of what’ll scale. Watch what happens in Rounds 3 & 4. Also, a great example of why only looking at Percent of Batch in SSMS can be misleading… 45% of the batch was EXCEPT, even though CPU time registered 0 and overall Duration was only 7? If you’re interested in seeing a complete picture of what’s happening on a server when you hit F5 in SSMS, take the time to kick off Profiler.

ROUND 3: Big Match

Ah, now the pros are entering the cage! We’re working with over 300,000 rows in each of our two tables, and there might be some blood spilled (finally!).

LEFT JOIN NOT EXISTS
SELECT m.CageMatchDate
FROM ExceptCageMatchMain m
LEFT JOIN ExceptCageMatchSecondary s
ON m.CageMatchDate = s.CageMatchDate
WHERE s.CageMatchDate IS NULL;
SELECT CageMatchDate
FROM ExceptCageMatchMain m
WHERE NOT EXISTS
(
SELECT CageMatchDate
FROM ExceptCageMatchSecondary s
WHERE m.CageMatchDate = s.CageMatchDate
);
EXCEPT NOT IN
SELECT CageMatchDate
FROM ExceptCageMatchMain
EXCEPT
SELECT CageMatchDate
FROM ExceptCageMatchSecondary;
SELECT CageMatchDate
FROM ExceptCageMatchMain
WHERE CageMatchDate NOT IN
 (SELECT CageMatchDate
 FROM ExceptCageMatchSecondary)

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch
LEFT JOIN 389 1912 436 7
NOT EXISTS 390 1912 372 7
EXCEPT 624 1912 705 12
NOT IN 1262 1001148 880 74

Ah, the Old Man’s starting to show some wear. Logical reads for NOT IN went through the roof, and if you look at the Execution Plan, you can see why:

Poor guy had to spool out to a temporary table to hold the rows. None of the other contestants ran into this, and it absolutely killed NOT IN. I think he might be on his last leg. Very interesting that LEFT JOIN, NOT EXISTS, and EXCEPT are still showing the same number of logical reads. I was surprised by this.

ROUND 4: The Main Event

Finally, the round you’ve all been waiting for! A chance to see a no holds barred showdown between each of the contestants. So far, NOT IN has fallen behind, but LEFT JOIN, NOT EXISTS, and EXCEPT have both held up well. In this round, we’re working with two tables containing well over 6 million rows each. Just to make it more challenging, we’ve set up a two-part key to evaluate. Let’s see the final queries and results:

LEFT JOIN NOT EXISTS
SELECT m.CageMatchAlpha,  m.CageMatchNumber
FROM ExceptCageMatchGinormous m
LEFT JOIN ExceptCageMatchGinormousSub s
ON m.CageMatchAlpha = s.CageMatchAlpha
AND m.CageMatchNumber = s.CageMatchNumber
WHERE s.CageMatchAlpha IS NULL
AND s.CageMatchNumber IS NULL;
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous m
WHERE NOT EXISTS
(
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormousSub s
WHERE m.CageMatchAlpha = s.CageMatchAlpha
AND m.CageMatchNumber = s.CageMatchNumber
);
EXCEPT NOT IN
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous
EXCEPT
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormousSub;
SELECT CageMatchAlpha, CageMatchNumber
FROM ExceptCageMatchGinormous
WHERE CageMatchAlpha+'-'
  +CAST(CageMatchNumber as varchar(10))
NOT IN
 (SELECT CageMatchAlpha+'-'
  +CAST(CageMatchNumber as varchar(10))
 FROM ExceptCageMatchGinormousSub);

OUTCOME:

CONTESTANT CPU READS DURATION Percent of Batch SSMS Run Time
LEFT JOIN 12446 30844 13029 8 0:14
NOT EXISTS 11483 30840 13917 7 0:13
EXCEPT 19456 30848 19988 10 0:20
NOT IN 60216 20363994 116674 76 1:52

Aha! Finally, we see some real difference in performance. The Old Man (NOT IN) has been knocked out cold. But, surprisingly, the other 3 contestants held up surprisingly well. EXCEPT started to show some cracks in the millions-of-rows arena and lagged behind LEFT JOIN and NOT EXISTS. I think I’m going to have to call it a tie between The Ninja and The Existentialist, with The New Kid in a close third.

This post was fun to research and fun to write. But there’s a moral to the story besides “NOT IN = Bad and LEFT JOIN = Good”. If you’re in a development environment, think about the size of your development data set versus what you expect your production data to look like. (Yeah, you heard me… Capacity Planning) Develop using best practices, and avoid lazy pitfalls (Like NOT IN). And most importantly, take the time to try different options. I mean, who hasn’t dreamed of refereeing a cage match?

Query on, my friends….

–Audrey

Noir SQL… Or, a Hardboiled Approach to Getting the Job Done

You can tell a lot about my state of mind by the books I’m reading. Lately, it’s Urban Fantasy with a Noir feel to it. Specifically, I’m reading Mike Carey’s Felix Castor series, and I just finished a book by Richard Kadrey called Sandman Slim: A Novel. I love the anti-hero. The protagonist who is gritty and dirty and has a few great scars is my kind of guy. He unapologetically breaks the rules and isn’t all, “it’s more about the journey than the destination.” For him, destination is what matters, no matter now you got there.

Lately, I feel a bit like the scarred anti-hero. I’m doing some things in a production environment that I’m not totally thrilled about, and I wish I could stop the line and do things the “right” way. I want to use SSIS to transform data. I want to encapsulate processes into neat, repeatable, parameterized modules. But, you know what? When there’s a same-day turnaround on a request, you make do. You go a little Noir on your T-SQL, know what I mean?

I want to show you two things that I’ve actually done in the past few weeks. No, given a nice, neat environment, this SQL might never have been written. Am I proud of it? Well, yes. Yes I am. At the end of the day, I got the customer what he needed. Was it pretty? No. I’m cool with that. Being the anti-hero is kind of fun every once in a while.

Fixed-Width Output

I needed to give a guy a text file in fixed-width format. I had a process from my predecessor that just wasn’t working. The file was already late. So here’s what I did. I’m using the AdventureWorks database to show an example.

SELECT
	LEFT((ISNULL(Title,'')+SPACE(50)), 8)+
	LEFT((ISNULL(FirstName,'')+SPACE(100)), 20)+
	LEFT((ISNULL(LastName,'')+SPACE(100)), 30)+
	LEFT((ISNULL(MiddleName,'')+SPACE(100)), 5)+
	LEFT((ISNULL(EmailAddress,'')+SPACE(100)), 35)+
	LEFT((ISNULL(Phone,'')+SPACE(100)), 25)
FROM AdventureWorks.Person.Contact	;

The result:

Paste it into Notepad and see how it looks:

I save the text file and send it on. Pour myself a whiskey, neat, and light up an unfiltered Lucky Strike.  Okay, not really, but you know what I mean. 

A quick run-down:

ISNULL: If any of the values I’m concatenating are NULL, then the entire string will come back as NULL. I wrap all of my columns in ISNULL like so:

ISNULL(Title, ‘’)

This sets the value to an empty string if the value is NULL.

SPACE: This handy little string function will pad the given number of spaces onto the result you return. I want to make sure I end up with enough padded spaces to fill out the fixed-width portion of that column. So, I pad the output:

ISNULL(Title, ‘’)+SPACE(50)

This will give me the output from the Title column, plus 50 spaces.

LEFT: Now, not every value coming out of the database is going to have the exact same number of columns. So, I use the LEFT function to trim it down to the exact length I want. LEFT will take the left-most number of characters you tell it to. If I say,

LEFT((ISNULL(Title,”)+SPACE(50)), 8 )

I’m telling it to give me characters 1-8 that are returned. Since I’ve padded my output with spaces, it’ll be the result from the column, plus as many spaces as I need to pad the output to 8.

Pretty? No. Functional? Yes. Noir SQL? Absolutely.

Remove Unwanted Characters

Next up, I have a source file I use from another department. It comes in Excel format, and includes a phone number. I’m supposed to get something that looks like this: 1112223333. Nice, neat, simple. What do I get? A hodge-podge of phone number formats. I’m looking at something like this:

CREATE TABLE PhoneNumber
(
	PhoneNumber varchar(50)
); 

INSERT INTO PhoneNumber(PhoneNumber)
VALUES
	('1112223333'), ('(111) 222-3333'), ('111-222-3333'), ('111 222 3333'); 	

SELECT PhoneNumber
FROM PhoneNumber

Okay. So I need to clean these numbers up quickly. Destination, not journey, my friends. I’m the anti-hero. I import the data into SQL Server using the Import/Export utility so I can manipulate the data. Then, I run this nifty little REPLACE statement:

SELECT PhoneNumber,
	CASE
	WHEN ISNUMERIC(PhoneNumber) = 0
		THEN REPLACE(
			REPLACE(
				REPLACE(
					REPLACE(PhoneNumber, '-', ''),			--Strip out dashes
				' ', ''),							--Strip out spaces
			')', ''),								--Strip out close parenthesis
		'(', '')									--Strip out open parenthesis
		ELSE PhoneNumber
	END as FormattedPhoneNumber
FROM dbo.PhoneNumber

Check out the results:

Sweet. It’s quick, it’s dirty, and it saved me having to wait on the source data provider to clean things up on his end. I turn the query into an UPDATE statement, and I’ve got clean data to import.  Again, a run-down of the functions:

ISNUMERIC: Tells me whether the value I’m passing is qualifies as a number or not. NOTE: It recognizes hexadecimal as a number, so use carefully. I set up a CASE statement that asks if the value is numeric. If it is, that means I don’t have any characters like “(“, “)”, or “-“ in there. If not, I apply a nested REPLACE to the value.

REPLACE: Replace is awesome. I can say something like this: REPLACE(PhoneNumber, ‘-‘, ‘’). This is saying that if I find a dash, I want to replace it with an empty string. What’s really cool is that I can nest them. So, I can tell it to remove the dashes, then the spaces, then the open parenthesis, and finally the close parenthesis in one statement.

Bottom line: Sometimes things just have to get done. The difference between an anti-hero and a true antagonist is that we anti-heroes know to go back and do things the right way as soon as we get a moment to breathe. In the meantime, don’t apologize for leaving behind a few unmarked graves when you need to get the job done. We’re anti-heroes. We have the scars to prove it.

FOR XML a la Mode –a study on Modes, especially using For XML Path Mode to Concatenate Data

Hello folks, Julie here with the as promised weekly Datachix blog—

Hup Hup—did you say just say  “Julie” here?

Yes I did.

You specifically said last week that you and Audrey would blog weekly and ALTERNATELY.  You posted last week. Isn’t it Audrey’s turn?

Yes, but Audrey is fiercely battling Reporting Services at a new client site–she can’t post this week.   Also, I think there were Ninjas…. and maybe some Bear-Sharks……… she’s really really….. busy.

Fine.

 Hello folks, Julie here with the as promised weekly Datachix blog.  😉   Recently a buddy of mine left a comment for us which went like this:

 Can you post an example of putting data from multiple rows into a single column?

 For example you have this:

Cat | meow

Cat | eat

Dog | bark

Dog | fetch

 

And you want to display it as this:

Cat | meow, eat

Dog | bark, fetch

It’s hard to find a good example of this online.

Thanks, Ben

  Continue reading

Getting Schooled on Dynamic Pivot… Or, PIVOT Part 2

A note: I’m reposting this because I accidentally deleted it from WordPress. Because I’m an idiot.

I wrote a post about Overcoming my Fear of Pivot. With my newfound confidence, I decided to tackle dynamic pivots. This is a common scenario where you need to PIVOT, but you don’t know exactly what you’re going to end up with. Basically, you want to allow all of the possible column headers to come back with the aggregated data you need.

If you’re not familiar with PIVOT, go back and read the original post. If I’ve done my job properly, it should make sense. So, here’s what I did… I resisted the urge to hit Google to find a solution to the dynamic pivot problem. I opened SSMS and said, “Self, you’re under a deadline. Write it and see if you can get it to work all by your lonesome”. 45 minutes later, I had a working script that produced some cool real-world output, if I do say so myself.

Then, I hit Google. Then I saw Itzik Ben-Gan’s solution. My first response was, “Crap!” Actually, it was a much less ladylike expletive than that. The solution was… Beautiful. Elegant. Blew my method out of the water. You know how athletes have muscle memory? Well, developers have it too. We fall back to what’s comfortable and familiar. Sort of like our own version of T-SQL sweatpants and chocolate ice cream. Before I start in on the comparison of my solution and Itzik’s, let me say this: His is so much better than mine. Did I mention that it was elegant? And beautiful? But you know what? In a real development environment, with deadlines and giant to-do lists, I would have fallen back to my own comfort zone. I know this. I also know that next time I need to write a dynamic PIVOT, I’m going to know how to use his method.

Authors, when asked to give advice to aspiring writers, always say the same thing. “Write what you know.” For us IT Folk, there’s a corollary. “Write what you know. Hit the deadline. Then, go learn a better way.” Am I proud that I figured a solution out on my own? Yup. Am I a bit deflated that I didn’t come up with the same solution as Itzik Ben-Gan? Nope. Come on, it’s Itzik.

Personal note: I hate when I run across someone else’s T-SQL and ask them, “How does this work?”, and their response is, “I don’t know, I found it on a blog post/Google/forum.” Peeps, this is unacceptable. Don’t copy and paste until you understand what you’re seeing. Because someday you’re going to have to maintain that pilfered bit of code. If you don’t know what it does, then don’t use it. Comprehend your own code. We all borrow from the experts, but make sure you can explain it in 50 words or less. If you can’t, then back away from the Ctrl+V. Stretch your skills, learn new things, just don’t jeopardize a project by jumping the gun.

Okay, enough commentary. On to the solutions. The trick in a dyamic PIVOT is to create a string that has all of the column headers you need. This is where he and I diverged wildly. I fell back on a WHILE Loop over a set of rows contained in a table variable, he used the STUFF function with a FOR XML Path() query output. I wrote my solution to address the same example from BOL that I ranted about in my first post. I modified his solution to produce the same output, and to clean out some unused variables that were in the sample I found. I’ve also resisted the urge to make little tweaks to my script after doing some extra research. Truly, I want to make the point that there’s what works… and what works beautifully.

My solution:

SET NOCOUNT ON;

DECLARE @vEmployeeIDTable as TABLE
(
EmployeeID varchar(20) NOT NULL
,ProcessedFlag bit NOT NULL DEFAULT(0)
)

DECLARE @vEmployeeID varchar(20)
DECLARE @vSQLString varchar(max) = ”
DECLARE @vEmployeeIDSELECT varchar(max) = ”
DECLARE @vEmployeeIDFOR varchar(max) = ”
DECLARE @vLoopCounter varchar(50) = 1

INSERT INTO @vEmployeeIDTable(EmployeeID)
SELECT DISTINCT EmployeeID
FROM Purchasing.PurchaseOrderHeader;

WHILE (SELECT count(ProcessedFlag) FROM @vEmployeeIDTable WHERE ProcessedFlag = 0) > 0
BEGIN

SELECT @vEmployeeID = ‘[‘+cast(MIN(EmployeeID) as varchar(20)) +’]’
FROM @vEmployeeIDTable
WHERE ProcessedFlag = 0

SET @vEmployeeIDSELECT = @vEmployeeIDSELECT + @vEmployeeID + ‘ as Emp’+@vLoopCounter+’,’
SET @vEmployeeIDFOR = @vEmployeeIDFOR + @vEmployeeID +’,’

UPDATE @vEmployeeIDTable
SET ProcessedFlag = 1
WHERE EmployeeID = cast(substring(@vEmployeeID,2, LEN(@vEmployeeID)-2) as int)

SET @vLoopCounter = @vLoopCounter + 1

END

SET @vEmployeeIDSELECT = SUBSTRING(@vEmployeeIDSELECT,1, len(@vEmployeeIDSELECT)-1)
SET @vEmployeeIDFOR = SUBSTRING(@vEmployeeIDFOR,1, len(@vEmployeeIDFOR)-1)

SET @vSQLString = ‘
SELECT VendorID, ‘+@vEmployeeIDSELECT +’
FROM
(SELECT PurchaseOrderID, EmployeeID, VendorID
FROM Purchasing.PurchaseOrderHeader) p
PIVOT
(
COUNT (PurchaseOrderID)
FOR EmployeeID IN
(‘+@vEmployeeIDFOR+’)
) AS pvt
ORDER BY pvt.VendorID; ‘

PRINT @vSQLString

EXECUTE (@vSQLString)

So, a quick rundown of what I did:

1) Create a table variable (@vEmployeeIDTable). Populate it with DISTINCT EmployeeID’s from Purchasing.PurchaseOrderHeader.
2) Declare the following variables:
a) @vEmployeeID – holds the EmployeeID I’m concatenating into the string during the WHILE loop
b) @vEmployeeIDSELECT – holds the EmployeeID string that I’ll use in the SELECT clause of my PIVOT. I separate this one out because I want to concatenate the column aliases just as they were in the BOL example.
c) @vEmployeeIDFOR – holds the EmployeeID string that I use in the FOR clause of my PIVOT. I don’t need column aliases here.
d) @vLoopCounter – holds a counter as I loop through the string concatenation. I use it to help name my column aliases (Emp1, Emp2…). The 1 and 2 are coming from this variable
3) While I have unprocessed rows in my table variable, I loop through with a WHILE
a) Set @vEmployeeID to the minimum EmployeeID that hasn’t been processed. I also concatenate on the brackets I need since these will become column names. (Those brackets were a pain. I kept having to work around them. Another place where Ben-Gan’s method was more elegant)
b) Set @vEmployeeIDSELECT to itself plus the EmployeeID being processed (@vEmployeeID), and then set up the alias. (as ‘Emp’+@vLoopCounter). Important note: I initialized the variable as an empty string (”). This is so that I’m not trying concatenate a NULL value to a string on the first go-round.
c) Set @vEmployeeIDFor to itself plus the EmployeeID being processed
d) Update @vEmployeeIDTable to indicate that the EmployeeID has been added to the string variables
e) Update @vLoopCounter so that the next table alias will be the next number
4) Clean up the extra commas at the end of the string variables
5) Put the whole thing together in @vSQLString
a) Place the @vEmployeeIDSELECT variable where it needs to go
b) Place the @vEmployeeIDFOR variable where it needs to go
6) Execute the variable @vSQLString

This is the output:


Okay, not bad. Now, the elegant Itzik Ben-Gan solution:

DECLARE
@cols AS NVARCHAR(MAX),
@sql AS NVARCHAR(MAX);

SET @cols = STUFF(
(SELECT N’,’ + QUOTENAME(EmployeeID) AS [text()]
FROM (SELECT DISTINCT EmployeeID FROM Purchasing.PurchaseOrderHeader) AS Y
ORDER BY EmployeeID
FOR XML PATH(”)),
1, 1, N”);

SET @sql = N’SELECT ‘+@cols +’
FROM (SELECT VendorID, EmployeeID, PurchaseOrderID
FROM Purchasing.PurchaseOrderHeader) AS D
PIVOT(COUNT(PurchaseOrderID)
FOR EmployeeID IN(‘ + @cols + N’)) AS P
ORDER BY P.VendorID;’;

PRINT @SQL

EXEC sp_executesql @sql;
GO

I know, right? Elegant. So what did he do?

1) Declared a couple of variables
a) @cols – holds the string of column values for the PIVOT
b) @sql – holds the SQL statment that gets executed
2) Used a FOR XML PATH(”) command to concatenate the string. This is cool. The query pulls EmployeeID’s out of a derived table in the FROM Clause. He orders by EmployeeID (which is not required), and outputs the result of this query using FOR XML PATH(”). The FOR XML PATH(”) clause creates a single row that looks like this:

,[250],[251],[252],[253],[254],[255],[256],[257],[258],[259],[260],[261]

Wow, exactly what we need for the PIVOT. Well, almost. That’s what the STUFF function is for. Getting rid of “almost”.

3) Also, see how he used QUOTENAME to add the brackets he needed?

QUOTENAME(EmployeeID) AS [text()]

4) Then, since that leading comma (,[250]) is not needed, he uses the STUFF command to strip it off. STUFF looks like this:

STUFF ( character_expression , start , length ,character_expression )

a) character_expression – the results of the query containing the FOR XML PATH(”) output
b) start – first character
c) length – how many characters to replace with what we’re “stuffing” in. In this case, a length of 1.
d) character_expression – an empty string, which is what’s’ “stuffed” into the first character expression, eliminating the comma.

Try this to illustrate it much more simply:

SELECT STUFF(‘abcdef’, 1, 1, ”);

Your result is: ‘bcdef’. The empty string he specified basically replaces the first character which is the comma we don’t want. Seriously, I had to run the baby STUFF to understand it properly. The beauty of STUFF over SUBSTRING is that SUBSTRING requires you to tell the function the length of the resulting string, which would require a LEN function over the entire subquery to get it right. It saves you having to execute that bad boy more than once.

5) Finally, he just puts the PIVOT query into @sql, concatenating in @cols where he needs to, and then executes it.

This is his output:

So he didn’t do pretty column aliases, but the important data is the same. And just take a look at the execution plans. That’s where I do feel just a bit deflated. Mine is monstrous. His? TWO queries. TWO! But that’s not the point. The point is, I had a blast figuring out how to write my own dynamic PIVOT. I had even more fun dissecting Itzik Ben-Gan’s method. (Yeah, I know. I’m a dork.) And, you can bet your sweet bippy that I’ll be working to make sure that FOR XML PATH, STUFF, and QUOTENAME all become part of my T-SQL muscle memory.

Recipe for Random Rows

Audrey and I are planning on presenting again at SQL Saturday 41 in Atlanta. I’ve decided to present on the different ways to handle slowly changing dimensions. The term Slowly Changing Dimension is associated with datawarehousing, but the general idea can be applied to any dataset, not just a dimension in a datawarehouse. In a nutshell, you are comparing a data source to a data target based on a key, determining which rows already exist in the target, updating the target on those rows, and loading as new inserts the rows from the source which do not exist in the target.

Rather lengthy intro to a post about an only semi-related topic, so I’ll cut to the chase. I needed a large dataset for my presentation. I was shooting for at least a million rows. I decided to go with Sales Reps and Regions, so I now needed many names for my fictional employees. I raided all the baby name sites on the internet for lists of first names. Then I googled for lists of common surnames. I crammed those names into some quick and dirty (Audrey would call “rogue”) tables and mixed them all up using a cross join into a new table. But my end result displeased me. It did not look “random” like it was loaded from a transactional database. The rows were remaining stubbornly alphabetized.

Out to GOOGLE I go. Apparently many folks have encountered this exact same quandary. Here was the solution I found right away. Sort by NEWID(). NEWID creates a UniqueIdentifier datatype. You can use it explicity, or as I’m about to do, you can use it in an order by statement to sort randomly.

So as Audrey stated in her very first blog, this is one of those things that has been out there for a while, but darn it all I didn’t know how to do it until a need for it came up in my life. I used this recipe to create a table with over a million rows.

Again for the sake of the children, I include my sql here:

Use tempdb;
go

create table FirstNames( FirstName varchar(50));
go
create table LastNames( LastName varchar(50));
go

create table FullNames( FirstName varchar(50), LastName varchar(50), FullName varchar(101));
go

/*
–raid the internet for baby names and common surnames. There are plenty of choices. For my large set,
I used
about 1000 surnames and 1000 first names. Here I’ll do ten of each.
I literally copied the names off of pages and pasted them into text pad docs.
I then loaded them into staging tables with ssis, but you
could also use openrowset, or any other favorite method for cramming data into tables from text. */

insert into FirstNames(FirstName)
Values(‘Alan’),
(‘Betty’),
(‘Charles’),
(‘Deborah’),
(‘Edward’),
(‘Francine’),
(‘George’),
(‘Heather’),
(‘Ian’),
(‘John’)

insert into LastNames(LastName)
values(‘Addams’),
(‘Boddington’),
(‘Clarke’),
(‘Christopherson’),
(‘Daniels’),
(‘Ellington’),
(‘Jones’),
(‘Johnson’),
(‘Smith’),
(‘Tanner’)

–use a cross join (aka cartesian join) to produce one row for every first and last name combination
–possible between the two tables:

select fn.FirstName, ln.Lastname, fn.FirstName +’ ‘+ ln.Lastname as Fullname
from FirstNames fn
cross join LastNames ln

— boo very alphabetical! This does not look Random. They’re all gonna laugh at you Julie!

select fn.FirstName, ln.Lastname, fn.FirstName +’ ‘+ ln.Lastname, NEWID() as RandomnessMaker
from FirstNames fn
cross join LastNames ln
order by NEWID()
—showing you the value of the newid() for demo purposes, you don’t need to actually view or store it.

–The actual beautiful insert.
insert into FullNames (FirstName, LastName, FullName)
select fn.FirstName, ln.Lastname, fn.FirstName +’ ‘+ ln.Lastname
from FirstNames fn
cross join LastNames ln
order by NEWID()

—voila! Beautiful random names!
select * from FullNames

Use a Common Table Expression and the ROW_NUMBER() Function to Eliminate Duplicate Rows

Or, removing duplicates with panache…

I think of them as rogue tables.  They’re quick and dirty and cause you a world of hurt before it’s all over.  We’ve all got them.  Like those photos from college that guarantee you’ll never run for public office, rogue tables are best left hidden.  But, you’re always wondering when they’re going to show up in public.  I admit, I have one.  There, I said it.  It’s a config table for our ETL processes that we threw out there at the last minute to handle a data-driven filter on an import process.  Didn’t stop and think about a primary key or constraints, just threw it into the database to get something done before the production push.  Yes, you heard right.  Production.  Oy vey.  

The other day, I was making use of my roguish table to add a few rows (in my development environment, thank the database gods).  Trying to multi-task, I ran a scripted insert on it.  Then I answered the phone, responded to an IM, read an e-mail, and turned around and executed the same blasted statement.  Without constraints of any kind to save my distracted soul, I now had two of each row.  I don’t care what those guys at Wrigley’s say, sometimes two of something doesn’t double the fun.  It did double the headache I was already nursing from a morning status meeting.  

I needed to get those extra rows out with as little pain as possible.  I needed to make it interesting.  Look, I find my thrills wherever I can.  I took a CTE/ROW_NUMBER() approach to finding and removing my duplicate rows.  First, let’s talk about these two constructs.  

Common Table Expressions (CTE)

I’ve heard CTE’s described a few different ways:  in-line temp table, in-line view, work area, etc.  What it does is allow you to create a temporary, named result set.  It persists (is scoped) for a single SELECT, INSERT, UPDATE, or DELETE statement.  It is a lot like creating a temporary table or using a table variable, but with about half the hassle.  The syntax is crazy-simple:  

WITH <any name you want> AS 
(
SELECT col1, col2
FROM tblx
)
<Your SELECT, INSERT, UPDATE, or DELETE goes here>; 

Basically, you can prep data to be used in the statement that immediately follows your WITH.  It’s great for any pesky operation that just won’t work well in a single statement.  Personally, I think it is easier to read, too.  One note:  If you’re running multiple statements in a batch, make sure you end the statement just prior to the WITH with a semi-colon.  In fact, just end everything with a semi-colon.  It makes you look detail-oriented.  

ROW_NUMBER()

ROW_NUMBER() falls into the “ranking functions” category.  With this quite functional function, you can number rows in your result set.  Even better, you can PARTITION BY to split your result set up into groups.  I might not want to see 1-10 as my row numbers, I might want to see 1-5 and 1-5 based on some column that I decide to partition the data by.  Note, this is a horizontal partition of rows.  If you’re trying to partition your columns vertically, we might need to talk over a beer or two.  You’ve got bigger issues than duplicate rows.  The syntax takes a little getting used to, but once you break it down, it makes pretty decent sense:  

ROW_NUMBER() OVER (PARTITION BY colx, coly… ORDER BY colz) as aliasname

Let’s take a closer look:  

  • ROW_NUMBER() – you’re instructing the query engine to give you back a column with row numbers.  These come back as a bigint.  
  • OVER – you’re telling it that you’re about to give it some more information.  Specifically, an ORDER BY and an optional PARTITION BY. 
  • PARTITION BY – you’re providing instructions about how to group the rows.  You can partition by multiple columns.  This works a little like a GROUP BY clause.
  • ORDER BY – what order do you want your rows numbered in?  If you have a PARTITION BY, it’ll order within each partition.  If you’ve left the PARTITION BY out, it’ll order the entire result set
  • alias – you’re going to have to alias this new column so that you can reference it later on  

Now that we’re all CTE and ROW_NUMBER() experts, let’s talk about how we put these guys to work to undo my bone-headed duplicate row insert.  I’m scripting an example here, with bonus semi-witty comments.

–Create the rogue table
IF EXISTS (SELECT * FROM sys.tables WHERE name = N’TableOfShame’)
BEGIN
    DROP TABLE TableOfShame
END

CREATE TABLE TableOfShame
(
    ShameCode varchar(4) NULL,
    ShameType varchar(15) NULL,
    ShamePriority varchar(10) NULL
);

–Insert the rows you really wanted in your table
INSERT INTO TableOfShame
VALUES
    (’01’, ‘Chagrin’, ‘Low’),
    (’02’, ‘Disgust’, ‘High’),
    (’03’, ‘Abashment’, ‘Medium’),
    (’04’, ‘Embarassment’, ‘Low’),
    (’05’, ‘Humiliation’, ‘Medium’);

/* Answer the phone, check your e-mail, listen to your co-worker tell hilarious story, get generally distracted */

–Oops, insert them again (Note the sleek and modern Table Value Constructor)
INSERT INTO TableOfShame
VALUES
    (’01’, ‘Chagrin’, ‘Low’),
    (’02’, ‘Disgust’, ‘High’),
    (’03’, ‘Abashment’, ‘Medium’),
    (’04’, ‘Embarassment’, ‘Low’),
    (’05’, ‘Humiliation’, ‘Medium’);

–Look what you’ve done!  Damn that funny anecdote that completely derailed your train of thought.
SELECT * FROM TableOfShame;

–Find the duplicates, give them something differentiating (a row number!)
/* Based on my made-up business rules, I’ve partitioned by something resembling a business key.  It’ll give me unique groups, which is sort of an oxymoron, but you know what I mean.  */
WITH cte_FindDuplicateShame as
(
SELECT ShameCode, ShameType, ShamePriority,
ROW_NUMBER() over(PARTITION BY ShameCode, ShameType ORDER BY ShameCode DESC) as RowNum
FROM dbo.TableOfShame
)
SELECT ShameCode, ShameType, ShamePriority, RowNum
FROM cte_FindDuplicateShame
ORDER BY ShameCode, ShameType, RowNum;

–Now, we know what we have, let’s delete the duplicates
/*Note that I’m actually issuing the DELETE against the CTE.  Keep in mind that the CTE is only a temporary, named result set off of a physical table (sort of like an in-line view).  Running the DELETE against the CTE will affect the physical table that was used to create the result set. */

WITH cte_FindDuplicateShame as
(
SELECT ShameCode, ShameType, ShamePriority,
ROW_NUMBER() over(PARTITION BY ShameCode, ShameType ORDER BY ShameCode DESC) RowNum
FROM dbo.TableOfShame
)
DELETE cte_FindDuplicateShame
WHERE RowNum <> 1;

–Aha!  Distraction-created rows are gone.  
SELECT *
FROM TableOfShame
ORDER BY ShameCode;

So there you have it.  A mildly interesting way to get myself out of the hole I dug by getting F5 happy.  CTE on, my friends.