Friday, January 31, 2014

Removing duplicate rows in SQL Server

If you are using SQL Server 2005 or greater (we need the CTE (Common Table Expression) functionality) you can in one statement delete duplicate rows from a table. You can look at all columns in the table or you can look at a subset of the columns. The example below uses just one column (EmailAddress), but you can replace that one column with a comma separated list of the columns you want to consider.

Before we go off and start running delete scripts on our data, it is always a good idea to make sure you have a backup of the table so you can check you work after or restore if something goes wrong.

Let's first look at the data

-- see what rows have duplicates and which ones don't
SELECT  EmailAddress
      , row_number() OVER ( PARTITION BY EmailAddress ORDER BY ID) AS NumInstances
FROM    Person

-- show only the rows that are duplicates
with myTable as (
SELECT  EmailAddress
      , row_number() OVER ( PARTITION BY EmailAddress ORDER BY ID) AS NumInstances
FROM    Person
)
select * FROM myTable
    WHERE   NumInstances > 1

The actual delete statement

-- do the actual deleting of the duplicate rows
with myTable as (
SELECT  EmailAddress
      , row_number() OVER ( PARTITION BY EmailAddress ORDER BY ID) AS NumInstances
FROM    Person
)
DELETE  FROM myTable
    WHERE   NumInstances > 1

To modify the code to fit your situation you will typically just need to replace the items in red with the comma separated list of columns you want to consider. Next change the table name in blue to the table you are trying to remove the duplicates from. Finally, change the column in orange to be one or more columns (separated by comma). In this case, I have a primary key so, I used it. You could use all the columns you are considering as well.

For more info on how it works and pre-SQL 2005 solutions see here.

No comments: