On Overcoming My Fear of PIVOT

I’m intimidated by PIVOT.  I’ve had a heck of a time wrapping my head around it, which is shameful, because Junior Accountants have been making pivot charts in Excel for years.  They get it, so why can’t I?  Well, I’ve got a few theories, mostly related to my occasional fear of unfamiliar things, and of feeling dumb.  Anyway, I finally got into a situation where I couldn’t avoid it, and I had to dig in there and learn it.  Nothing like a deadline to make you act like a proper student. 

I went to BOL, and looked it up.  Now, I’m a fan of Books Online.  It saves my tush daily.  But in this case… I’m sorry, but the explanation is nonsensical.  I mean, I read it, and what I comprehend is, “blah, blah, PIVOT, blah, you’re an idiot, Audrey, just give up now”. 

So, being forced to use a PIVOT, I had to break it down into chunks that my tiny brain could consume.  So, first, let’s look at the BOL syntax: 

SELECT <non-pivoted column>,

    [first pivoted column] AS <column name>,

    [second pivoted column] AS <column name>,

    …

    [last pivoted column] AS <column name>

FROM

    (<SELECT query that produces the data>)

    AS <alias for the source query>

PIVOT

(

    <aggregation function>(<column being aggregated>)

FOR

[<column that contains the values that will become column headers>]

    IN ( [first pivoted column], [second pivoted column],

    … [last pivoted column])

) AS <alias for the pivot table>

<optional ORDER BY clause>;
Hoo-kay.  I’m going to step you through my process of understanding this so I could construct my own PIVOT.  I’m even going to use the complex pivot example from BOL, the AdventureWorks2008 database.  We’re going in this order:  FROM, PIVOT, FOR, SELECT. 

But first, some rules.  There are always rules: 

RULES: 
1) You have to know how many columns you’re going to end up with after the PIVOT.  This means that this operation is great for things like months in a year, not so great for a varying number of pivoted columns.  You can tell it which columns to return, but the bottom line is you need to know what your output should look like.  If you want to break this rule, you’re writing dynamic SQL. 
2) You’re going to have to aggregate.  Even if you don’t really want to.  It’s required, but as always, there are ways to work the syntax.

THE BOL QUERY EXAMPLE: 

SELECT VendorID, [250] AS Emp1, [251] AS Emp2, [256] AS Emp3, [257] AS Emp4, [260] AS Emp5
FROM
(SELECT PurchaseOrderID, EmployeeID, VendorID
FROM Purchasing.PurchaseOrderHeader) p
PIVOT
(
COUNT (PurchaseOrderID)
FOR EmployeeID IN
( [250], [251], [256], [257], [260] )
) AS pvt
ORDER BY pvt.VendorID;

THE BOL QUERY OUTPUT: 

 

THE BREAKDOWN: 

1) FROM (Source Query):  This is the derived table that lives in the FROM clause.  It produces the data that is going to be aggregated and pivoted.  Write this first.  Get familiar with what data you’re working with.  Don’t forget to give it an alias.  I like the ever-creative “as SourceQuery” to help me remember what that derived table’s doing there in the first place. 

FROM

    (<SELECT query that produces the data>)

    AS <alias for the source query>
   
In the BOL example, this is the Source Query: 

FROM (
SELECT PurchaseOrderID, EmployeeID, VendorID
FROM Purchasing.PurchaseOrderHeader) as p

It returns this: 

This is our raw data.  By the time we get to the bottom of this blog post, we’re going to COUNT PurchaseOrderID’s by EmployeeID, set some EmployeeID’s as column headers, and return what looks like a cross-tab report with VendorID’s as row headers, EmployeeID’s as column headers, and PurchaseOrder COUNT as detail data.  Really.  I promise. 

2) PIVOT (Aggregation/Summarization):  This is where you’re saying how to aggregate, or summarize what will end up in the cells.  Think of it this way:  If this were a spreadsheet, with column headers and row headers, the data produced by the PIVOT clause is the detail data living in the cells.  Now, you don’t always want to aggregate.  Sometimes you don’t have anything to aggregate, you just want to flip your data from rows to columns.  Too bad.  You’re aggregating something.  The solution I’ve seen is to do a MIN or MAX, but to make sure that the MIN or MAX is of a unique thing.  You’ll have to examine your data to see what works for you.  But back to PIVOT…

PIVOT
(
<aggregation function>(<column being aggregated>)

In the BOL example, it looks like this: 

PIVOT
(
COUNT (PurchaseOrderID)

So, what it’s saying is that the “detail” data (think like you’re in Excel for a moment) should be the count of PurchaseOrderID’s.  Simple enough.  But where’s my GROUP BY?  It feels like heresy, aggregating something without a GROUP BY.  Hang in there…

3) FOR (Sort-of GROUP BY):  FOR establishes what will be column headers for the PIVOT-ed (aggregated) data.  One cool thing about it not being a true GROUP BY is that I don’t have to include everything from my Source Query (FROM).  If you look at the BOL example, VendorID from my Source Query (FROM) isn’t included in the PIVOT or FOR clauses.  It’s a pass-through column.  It’s going to be there in the SELECT, and therefore in the output, but it isn’t part of the PIVOT process.  In fact, you don’t have to include VendorID at all.  The data probably wouldn’t make sense, but to each his own, right? 

FOR

[<column that contains the values that will become column headers>]

    IN ( [first pivoted column], [second pivoted column],

    … [last pivoted column])

) AS <alias for the pivot table>

In the BOL example, the query developer chooses to return the number of purchase orders for a specific set of Employees.  Yes, in the example it’s arbitrary, because they return 5 and there are actually 12 distinct EmployeeID’s in the Purchasing.PurchaseOrderHeader table, but I’m not here to judge.  How do they do this?  Like this: 

FOR EmployeeID IN
( [250], [251], [256], [257], [260] )
) AS pvt

This is telling the PIVOT to produce 5 columns, [250], [251], [256], [257], and [260].  (You don’t have to have the brackets, except that “250” wouldn’t be a valid column name without them.)  Those numbers are the actual EmployeeID’s returned from the Source Query.  You’re saying “FOR” an EmployeeID “IN” a specific set of values that were returned in the Source Query (FROM).  You’re essentially establishing a GROUP BY on EmployeeID.  What’s being “grouped” by the FOR clause?  The data that you’re aggregating in the PIVOT clause.  Cool, huh?  The COUNT of PurchaseOrderID’s will be placed underneath the column corresponding to the EmployeeID it belongs to.  Don’t forget to alias the FOR clause.  Something like “IRockBecauseIFiguredThisOut” works well.  🙂 Also, this is where you’re going to close the parenthesis that you opened up in the FROM clause. 

Personal Note:  This clause is one of the reasons I hate this BOL example.  It doesn’t make sense that I would hard-code EmployeeID’s.  A PIVOT example with months or years or something would be a more likely real-world scenario.  Making it an example implies that it’s a good idea, and that every person reading BOL knows not to assume that Employee 257 will be a lifer at Adventure Works.  But like I said, I don’t judge. 

4) SELECT (Presentation):  Why is it that SELECT is always the simplest part of a query?  It seems so important, but it really doesn’t do much.  It’s like the presentation layer of the query.  Here, you’re telling the query what to output.  As long as it was part of the Source Query (FROM), or defined as a column header in the FOR clause, you can include it in the SELECT clause.  In fact, if you’re feeling frisky, you can leave off columns.  The query doesn’t care, because the SELECT is just there to make things pretty. 

SELECT <non-pivoted column>,

    [first pivoted column] AS <column name>,

    [second pivoted column] AS <column name>,

    …

    [last pivoted column] AS <column name>

In the BOL example, it looks like this: 

SELECT VendorID, [250] AS Emp1, [251] AS Emp2, [256] AS Emp3, [257] AS Emp4, [260] AS Emp5

VendorID is a pass-through (non-pivoted) column.  It’s there to supplement the PIVOTed data.  The other columns are the ones we established in the FOR clause.  Just remember that everything you want to work with needs to be included in that Source Query (FROM clause). 

Putting it all together, it looks like this: 

SELECT VendorID, [250] AS Emp1, [251] AS Emp2, [256] AS Emp3, [257] AS Emp4, [260] AS Emp5
FROM
(SELECT PurchaseOrderID, EmployeeID, VendorID
FROM Purchasing.PurchaseOrderHeader) p
PIVOT
(
COUNT (PurchaseOrderID)
FOR EmployeeID IN
( [250], [251], [256], [257], [260] )
) AS pvt
ORDER BY pvt.VendorID;

The output looks like this: 

 

So there you have it.  A peek into my thought process as I worked to overcome my fear of PIVOT.  I’m good now.  I’ll still have to look up the syntax whenever I write it, but at least I won’t break out in to a cold sweat next time.  And next up for me… PIVOT with an unknown/dynamic number of output columns.  Woo-hoo!  Dynamic SQL! 

Query on, my friends.



2 thoughts on “On Overcoming My Fear of PIVOT”

  1. Very good write-up, I just wish more people would utilize pivot rather than relying on case statements so much.

    As for a dynamic pivot…

    Two weeks ago, the mere thought of dynamically pivoting would’ve scared me senseless, however, a project recently came about where the only ‘simple’ solution would be to dynamically pivot. A few variables, a loop or two, and a couple hours later and I had it working, not too bad for something which seemed so daunting. My one suggestion for tackling something like this is to do as you’ve done here; break-out the code into incremental steps and when in doubt reference BOL.

Leave a Reply