Removing duplicates from a large table before adding a clustered primary key

The challenge with this one, was to remove over 2,000 duplicate rows from a 600,000,000 row heap, before a multi-column clustered primary key could be added.

(I am so happy to finally be able to document this, from a while ago. The phrase “uniquely challenging” was never so apt.)

My first idea was to create a empty copy of the source table, add a “ignore duplicates” index, then start an INSERT [dbo].[targettable] SELECT * FROM [dbo].[sourcetable]. However, I stopped after an hour as it was going to take too long.

My second idea was to use SSIS to populate the target-table. Again this was just too slow.

My third idea was to just remove the duplicates (2,000 being a much smaller number than 600,000,000). First though, a disclaimer, I only considered this ** high risk ** solution because the data was in a QA environment (not Live), and people were waiting for me.

Step one then, was to populate a temp-table with the duplicates …

SELECT column1, column2, column3, column4
INTO #tmp1
FROM [dbo].[sometablename]
GROUP BY column1, column2, column3, column4
HAVING COUNT(*) > 1;

(“column1” etc were not the real column names by-the-way 🙂 ). Step two was to loop through this list removing duplicates from the permanent table

WHILE exists (SELECT 1 FROM #tmp1)
BEGIN

	DECLARE @1 BIGINT, @2 INT, @3 BIGINT
	DECLARE @4 VARCHAR(10)

	SELECT TOP(1) 
		@1 = column1, 
		@2 = column2, 
		@3 = column3, 
		@4 = column4 
	FROM #tmp1

	DELETE TOP(1) 
	FROM [dbo].[sometablename] 
	WHERE column1 = @1
	AND column2 = @2
	AND column3 = @3
	AND column4 = @4

	DELETE 
	FROM #tmp1
	WHERE column1 = @1
	AND column2 = @2
	AND column3 = @3
	AND column4 = @4

END

Step three was to implement the clustered primary key. (by-the-way I was half expecting this to fail with a duplicates error, but happily the above loop had cleared a single duplicate from each pair, and thats all that was needed).

ALTER TABLE [dbo].[sometablename] 
ADD CONSTRAINT pk_sometablename
 PRIMARY KEY CLUSTERED
 ([column1], [column2], [column3], [column4]);

** just to point out – I consider the loop “high risk” because, an error message, a typo, accidentally running it twice, or faulty logic, could have resulted in disaster 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s