0
0

Removing duplicate rows in small batches based on date column

Haidong Ji 发表于 2014年04月08日 12:19 | Hits: 5378
Tag: SQLServer | Technology

Due to double scheduling, some duplicate rows were inserted into the Windows.TableStats tablein JiMetrics. To confirm that the table has duplicates, here is the T-SQL script I used. Note that this same script should work in all other major RDBMS platforms like MySQL, Oracle, etc. Adjust table and column names to fit your needs.

SELECT HostID,
       InstanceID,
       DbName,
       SchemaName,
       TableName,
       CAST (CollectionDate AS DATE),
       COUNT (*)
  FROM Windows.TableStats
GROUP BY HostID,
	 InstanceID,
         DbName,
         SchemaName,
         TableName,
         CAST (CollectionDate AS DATE)
HAVING COUNT (*) > 1

So the duplicates need to be removed. In Microsoft SQL Server, that can be accomplished with a Common Table Expression (CTE) using the ROW_NUMBER ranking function, which is a pretty elegant solution.

However, when using DELETE to remove potentially large number of rows, say millions, it is always advisable to do that in small batches. Otherwise, you are risking running out of temp, log space, or even disk space, not to mention that you are potentially creating bigger and coarser locks than necessary.

So I decided to remove the duplicates on a daily basis to lessen the impact on the database instance. A runningDate variable is used and incremented by one until the desired end date. Within each iteration of the loop, that day’s duplicates are removed. Tweak it to suit your needs.

I think things like this make a good job interview question for a database administrator.

DECLARE @runningDate DATE
SET @runningDate = '20140101'

WHILE (@runningDate < '20140406')
BEGIN
   WITH dupRows
        AS (SELECT RecordID,
                   ROW_NUMBER ()
                      OVER (PARTITION BY HostID,
                                         InstanceID,
					 DbName,
                                         SchemaName,
                                         TableName,
                                         cast (CollectionDate AS DATE)
                            ORDER BY
                               HostID,
			       InstanceID,
                               DbName,
                               SchemaName,
                               TableName)
                      AS RankID
              FROM Windows.TableStats
             WHERE CAST (CollectionDate AS DATE) = @runningDate)
   DELETE dupRows
    WHERE RankID > 1

   SET @runningDate = DATEADD (DAY, 1, @runningdate)
END

原文链接: http://www.haidongji.com/2014/04/07/removing-duplicate-rows-in-small-batches-based-on-date-column/

0     0

我要给这篇文章打分:

可以不填写评论, 而只是打分. 如果发表评论, 你可以给的分值是-5到+5, 否则, 你只能评-1, +1两种分数. 你的评论可能需要审核.

评价列表(0)