Perhaps this should be categorized as a stupid people trick. On the other hand, it’s a good, cheap way to deduplicate files assuming you’re running SQL Server and LogParser. This post assumes at least a basic knowledge of T-SQL and LogParser command line syntax.
Basically, you use LogParser to create a two-column table containing file paths and MD5 hashes like this:
LogParser “SELECT path, HASHMD5_FILE(Path) AS MD5 INTO T_FILE_INFO FROM D:\some_path\*.*” -i:FS -o:SQL -database:files -driver:”SQL Server” -createTable:ON |
Note: the -o parameter directs LogParser to write the output to a SQL Server database table.
The T-SQL ranking function makes it very easy to find and pick out duplicate rows. Instead of finding duplicate rows, I use it to generate statements to delete duplicate files. The entire script looks something like this:
USE [TempDB] GO -- Create LogParser command DECLARE @LogParserCMD VARCHAR(8000); SET @LogParserCMD = 'LogParser "SELECT PATH, HASHMD5_FILE(Path) AS MD5 INTO #T_FILE_INFO FROM ''E:\some_path\*.*''" -i:FS -recurse:1 -o:SQL -database:files -driver:"SQL Server" -createTable:ON' -- Create table to store file paths and MD5 hashes IF EXISTS (SELECT * FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N'[dbo].[#T_FILE_INFO]') AND TYPE IN (N'U')) DROP TABLE [dbo].[#T_FILE_INFO] GO CREATE TABLE [dbo].[#T_FILE_INFO]( [PATH] [varchar](512) NULL, [MD5] [varchar](255) NULL ) ON [PRIMARY] GO -- Next, run LogParser to populate table. Either run from CMD prompt or use XP_CMDSHELL -- note: this step can be time consuming, since LogParser must generate an MD5 has for every file specified EXEC xp_cmdshell @LogParserCMD; --Get duplicate hashes; generate deletion commands WITH tblCTE(MD5, PATH, ranking)AS (SELECT MD5, PATH, Ranking = DENSE_RANK() OVER(PARTITION BY MD5 ORDER BY NEWID() ASC)FROM #T_FILE_INFO) SELECT 'DEL "' + PATH + '"' FROM tblCTE WHERE Ranking > 1 AND MD5 IS NOT NULL; -- Review the result set and run it from a command prompt -- clean up after yourself DROP TABLE #T_FILE_INFO |
Finally note that the MD5 hashing algorigthm has problems with hash collisions. Two unique files can produce the same hash. In this context there would be a risk of unintentionally deleting a file. You could tweak these scripts to generate and also match on file size. That would reduce the risk of unintentional deletions.