Info

You are currently browsing the Blog weblog archives for the day 14. April 2011.

Calendar
April 2011
S M T W T F S
« Feb   May »
 12
3456789
10111213141516
17181920212223
24252627282930
Categories

Archive for 14. April 2011

Deduplicating files with LogParser and SQL Server

Perhaps this should be categorized as a stupid people trick.  On the other hand, it’s a good, cheap way to deduplicate files assuming you’re running SQL Server and LogParser.  This post assumes at least a basic knowledge of T-SQL and LogParser command line syntax.

Basically, you use LogParser to create a two-column table containing file paths and MD5 hashes like this:

LogParser “SELECT path, HASHMD5_FILE(Path) AS MD5 INTO T_FILE_INFO FROM D:\some_path\*.*” -i:FS  -o:SQL  -database:files -driver:”SQL Server”  -createTable:ON

The -o parameter directs LogParser to write the output to a SQL Server database table.

The T-SQL  ranking function makes it very easy to find and pick out duplicate rows.  Instead of finding duplicate rows, I use it to generate statements to delete duplicate files.  The entire script looks something like this:


USE [TempDB]
GO

– Create LogParser command
DECLARE @LogParserCMD VARCHAR(8000);
SET @LogParserCMD = ‘LogParser “SELECT PATH, HASHMD5_FILE(Path) AS MD5 INTO #T_FILE_INFO FROM ”E:\some_path\*.*”” -i:FS -recurse:1 -o:SQL  -database:files -driver:”SQL Server”  -createTable:ON’

– Create table to store file paths and MD5 hashes
IF  EXISTS (SELECT * FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N’[dbo].[#T_FILE_INFO]’) AND TYPE IN (N’U’))
DROP TABLE [dbo].[#T_FILE_INFO]
GO
CREATE TABLE [dbo].[#T_FILE_INFO](
[PATH] [varchar](512) NULL,
[MD5] [varchar](255) NULL
)
ON [PRIMARY]
GO
– run LogParser to populate table.  Either run from CMD prompt or use XP_CMDSHELL

– note: this step can be time consuming, since LogParser must generate an MD5 has for every file specified

EXEC xp_cmdshell @LogParserCMD;

–Get duplicate hashes; generate deletion commands
WITH tblCTE(MD5, PATH, ranking)AS
(SELECT MD5, PATH, Ranking =
DENSE_RANK() OVER(PARTITION BY MD5
ORDER BY NEWID() ASC)FROM #T_FILE_INFO)

SELECT ‘DEL “‘ + PATH + ‘”‘
FROM tblCTE
WHERE Ranking > 1
AND MD5 IS NOT NULL;
– Review the result set and run it from a command prompt

– clean up after yourself
DROP TABLE #T_FILE_INFO

|