The challenge with this one, was to remove over 2,000 duplicate rows from a 600,000,000 row heap, before a multi-column clustered primary key could be added.
(I am so happy to finally be able to document this, from a while ago. The phrase “uniquely challenging” was never so apt.)
My first idea was to create a empty copy of the source table, add a “ignore duplicates” index, then start an INSERT [dbo].[targettable] SELECT * FROM [dbo].[sourcetable]. However, I stopped after an hour as it was going to take too long.
My second idea was to use SSIS to populate the target-table. Again this was just too slow.
My third idea was to just remove the duplicates (2,000 being a much smaller number than 600,000,000). First though, a disclaimer, I only considered this ** high risk ** solution because the data was in a QA environment (not Live), and people were waiting for me.
Step one then, was to populate a temp-table with the duplicates …
SELECT column1, column2, column3, column4
INTO #tmp1
FROM [dbo].[sometablename]
GROUP BY column1, column2, column3, column4
HAVING COUNT(*) > 1;
(“column1” etc were not the real column names by-the-way 🙂 ). Step two was to loop through this list removing duplicates from the permanent table
WHILE exists (SELECT 1 FROM #tmp1)
BEGIN
DECLARE @1 BIGINT, @2 INT, @3 BIGINT
DECLARE @4 VARCHAR(10)
SELECT TOP(1)
@1 = column1,
@2 = column2,
@3 = column3,
@4 = column4
FROM #tmp1
DELETE TOP(1)
FROM [dbo].[sometablename]
WHERE column1 = @1
AND column2 = @2
AND column3 = @3
AND column4 = @4
DELETE
FROM #tmp1
WHERE column1 = @1
AND column2 = @2
AND column3 = @3
AND column4 = @4
END
Step three was to implement the clustered primary key. (by-the-way I was half expecting this to fail with a duplicates error, but happily the above loop had cleared a single duplicate from each pair, and thats all that was needed).
ALTER TABLE [dbo].[sometablename]
ADD CONSTRAINT pk_sometablename
PRIMARY KEY CLUSTERED
([column1], [column2], [column3], [column4]);
** just to point out – I consider the loop “high risk” because, an error message, a typo, accidentally running it twice, or faulty logic, could have resulted in disaster 🙂