Deduplicating files with LogParser and SQL Server

Perhaps this should be categorized as a stupid people trick. On the other hand, it’s a good, cheap way to deduplicate files assuming you’re running SQL Server and LogParser. This post assumes at least a basic knowledge of T-SQL and LogParser command line syntax.

Basically, you use LogParser to create a two-column table containing file paths and MD5 hashes like this:

LogParser “SELECT path, HASHMD5_FILE(Path) AS MD5 INTO T_FILE_INFO FROM D:\some_path\*.*” -i:FS -o:SQL -database:files -driver:”SQL Server” -createTable:ON

Note: the -o parameter directs LogParser to write the output to a SQL Server database table.

The T-SQL ranking function makes it very easy to find and pick out duplicate rows. Instead of finding duplicate rows, I use it to generate statements to delete duplicate files. The entire script looks something like this:

USE [TempDB]
GO
 
-- Create LogParser command
DECLARE @LogParserCMD VARCHAR(8000);
SET @LogParserCMD = 'LogParser "SELECT PATH, HASHMD5_FILE(Path) AS MD5 INTO #T_FILE_INFO FROM ''E:\some_path\*.*''" -i:FS -recurse:1 -o:SQL  -database:files -driver:"SQL Server"  -createTable:ON'
 
-- Create table to store file paths and MD5 hashes
IF  EXISTS (SELECT * FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[#T_FILE_INFO]') AND TYPE IN (N'U'))
DROP TABLE [dbo].[#T_FILE_INFO]
GO
CREATE TABLE [dbo].[#T_FILE_INFO](
[PATH] [varchar](512) NULL,
[MD5] [varchar](255) NULL
) ON [PRIMARY]
GO
-- Next, run LogParser to populate table.  Either run from CMD prompt or use XP_CMDSHELL
 
-- note: this step can be time consuming, since LogParser must generate an MD5 has for every file specified
 
EXEC xp_cmdshell @LogParserCMD;
 
--Get duplicate hashes; generate deletion commands
WITH tblCTE(MD5, PATH, ranking)AS
(SELECT MD5, PATH, Ranking =
DENSE_RANK() OVER(PARTITION BY MD5
ORDER BY NEWID() ASC)FROM #T_FILE_INFO)
 
SELECT 'DEL "' + PATH + '"'
FROM tblCTE
WHERE Ranking &gt; 1
AND MD5 IS NOT NULL;
-- Review the result set and run it from a command prompt
 
-- clean up after yourself
DROP TABLE #T_FILE_INFO

Finally note that the MD5 hashing algorigthm has problems with hash collisions. Two unique files can produce the same hash. In this context there would be a risk of unintentionally deleting a file. You could tweak these scripts to generate and also match on file size. That would reduce the risk of unintentional deletions.

Recent Posts

Personal Links

SQL Server Links

Windows OS

Meta