Use a Common Table Expression and the ROW_NUMBER() Function to Eliminate Duplicate Rows

Or, removing duplicates with panache…

I think of them as rogue tables.  They’re quick and dirty and cause you a world of hurt before it’s all over.  We’ve all got them.  Like those photos from college that guarantee you’ll never run for public office, rogue tables are best left hidden.  But, you’re always wondering when they’re going to show up in public.  I admit, I have one.  There, I said it.  It’s a config table for our ETL processes that we threw out there at the last minute to handle a data-driven filter on an import process.  Didn’t stop and think about a primary key or constraints, just threw it into the database to get something done before the production push.  Yes, you heard right.  Production.  Oy vey.  

The other day, I was making use of my roguish table to add a few rows (in my development environment, thank the database gods).  Trying to multi-task, I ran a scripted insert on it.  Then I answered the phone, responded to an IM, read an e-mail, and turned around and executed the same blasted statement.  Without constraints of any kind to save my distracted soul, I now had two of each row.  I don’t care what those guys at Wrigley’s say, sometimes two of something doesn’t double the fun.  It did double the headache I was already nursing from a morning status meeting.  

I needed to get those extra rows out with as little pain as possible.  I needed to make it interesting.  Look, I find my thrills wherever I can.  I took a CTE/ROW_NUMBER() approach to finding and removing my duplicate rows.  First, let’s talk about these two constructs.  

Common Table Expressions (CTE)

I’ve heard CTE’s described a few different ways:  in-line temp table, in-line view, work area, etc.  What it does is allow you to create a temporary, named result set.  It persists (is scoped) for a single SELECT, INSERT, UPDATE, or DELETE statement.  It is a lot like creating a temporary table or using a table variable, but with about half the hassle.  The syntax is crazy-simple:  

WITH <any name you want> AS 
(
SELECT col1, col2
FROM tblx
)
<Your SELECT, INSERT, UPDATE, or DELETE goes here>; 

Basically, you can prep data to be used in the statement that immediately follows your WITH.  It’s great for any pesky operation that just won’t work well in a single statement.  Personally, I think it is easier to read, too.  One note:  If you’re running multiple statements in a batch, make sure you end the statement just prior to the WITH with a semi-colon.  In fact, just end everything with a semi-colon.  It makes you look detail-oriented.  

ROW_NUMBER()

ROW_NUMBER() falls into the “ranking functions” category.  With this quite functional function, you can number rows in your result set.  Even better, you can PARTITION BY to split your result set up into groups.  I might not want to see 1-10 as my row numbers, I might want to see 1-5 and 1-5 based on some column that I decide to partition the data by.  Note, this is a horizontal partition of rows.  If you’re trying to partition your columns vertically, we might need to talk over a beer or two.  You’ve got bigger issues than duplicate rows.  The syntax takes a little getting used to, but once you break it down, it makes pretty decent sense:  

ROW_NUMBER() OVER (PARTITION BY colx, coly… ORDER BY colz) as aliasname

Let’s take a closer look:  

  • ROW_NUMBER() – you’re instructing the query engine to give you back a column with row numbers.  These come back as a bigint.  
  • OVER – you’re telling it that you’re about to give it some more information.  Specifically, an ORDER BY and an optional PARTITION BY. 
  • PARTITION BY – you’re providing instructions about how to group the rows.  You can partition by multiple columns.  This works a little like a GROUP BY clause.
  • ORDER BY – what order do you want your rows numbered in?  If you have a PARTITION BY, it’ll order within each partition.  If you’ve left the PARTITION BY out, it’ll order the entire result set
  • alias – you’re going to have to alias this new column so that you can reference it later on  

Now that we’re all CTE and ROW_NUMBER() experts, let’s talk about how we put these guys to work to undo my bone-headed duplicate row insert.  I’m scripting an example here, with bonus semi-witty comments.

–Create the rogue table
IF EXISTS (SELECT * FROM sys.tables WHERE name = N’TableOfShame’)
BEGIN
    DROP TABLE TableOfShame
END

CREATE TABLE TableOfShame
(
    ShameCode varchar(4) NULL,
    ShameType varchar(15) NULL,
    ShamePriority varchar(10) NULL
);

–Insert the rows you really wanted in your table
INSERT INTO TableOfShame
VALUES
    (’01’, ‘Chagrin’, ‘Low’),
    (’02’, ‘Disgust’, ‘High’),
    (’03’, ‘Abashment’, ‘Medium’),
    (’04’, ‘Embarassment’, ‘Low’),
    (’05’, ‘Humiliation’, ‘Medium’);

/* Answer the phone, check your e-mail, listen to your co-worker tell hilarious story, get generally distracted */

–Oops, insert them again (Note the sleek and modern Table Value Constructor)
INSERT INTO TableOfShame
VALUES
    (’01’, ‘Chagrin’, ‘Low’),
    (’02’, ‘Disgust’, ‘High’),
    (’03’, ‘Abashment’, ‘Medium’),
    (’04’, ‘Embarassment’, ‘Low’),
    (’05’, ‘Humiliation’, ‘Medium’);

–Look what you’ve done!  Damn that funny anecdote that completely derailed your train of thought.
SELECT * FROM TableOfShame;

–Find the duplicates, give them something differentiating (a row number!)
/* Based on my made-up business rules, I’ve partitioned by something resembling a business key.  It’ll give me unique groups, which is sort of an oxymoron, but you know what I mean.  */
WITH cte_FindDuplicateShame as
(
SELECT ShameCode, ShameType, ShamePriority,
ROW_NUMBER() over(PARTITION BY ShameCode, ShameType ORDER BY ShameCode DESC) as RowNum
FROM dbo.TableOfShame
)
SELECT ShameCode, ShameType, ShamePriority, RowNum
FROM cte_FindDuplicateShame
ORDER BY ShameCode, ShameType, RowNum;

–Now, we know what we have, let’s delete the duplicates
/*Note that I’m actually issuing the DELETE against the CTE.  Keep in mind that the CTE is only a temporary, named result set off of a physical table (sort of like an in-line view).  Running the DELETE against the CTE will affect the physical table that was used to create the result set. */

WITH cte_FindDuplicateShame as
(
SELECT ShameCode, ShameType, ShamePriority,
ROW_NUMBER() over(PARTITION BY ShameCode, ShameType ORDER BY ShameCode DESC) RowNum
FROM dbo.TableOfShame
)
DELETE cte_FindDuplicateShame
WHERE RowNum <> 1;

–Aha!  Distraction-created rows are gone.  
SELECT *
FROM TableOfShame
ORDER BY ShameCode;

So there you have it.  A mildly interesting way to get myself out of the hole I dug by getting F5 happy.  CTE on, my friends.

7 thoughts on “Use a Common Table Expression and the ROW_NUMBER() Function to Eliminate Duplicate Rows”

  1. Good stuff datachick;

    We use rownumber quite a bit for pagination on our webpages;

    CTE’s are good for lots of stuff; odd self-joining aggregations, hierarchies, and of course the aforementioned shame removal;

    Julie;

  2. Thank you. I found your site too late. I was trying to partition by more than 1 column but didn’t know how to do it until I read your post. I did extra work and views for nothing, but now I know for the future and will clean up my database!

  3. Great article! I’ve been introduced to the consept earlier, but without the CTE and with less interesting examples. Thanks a lot!

Leave a Reply