You are currently browsing the Blog weblog archives for the day 14. April 2011.
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
| « Feb | May » | |||||
| 1 | 2 | |||||
| 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 |
- IIS (1)
- Open Source (1)
- Performance (4)
- Personal (3)
- Powershell (2)
- SQL (1)
- SQL Server (21)
- T-SQL (15)
- Uncategorized (6)
- Utilities (5)
- Windows OS (15)
- 16. May 2012: Powershell file mover
- 14. April 2011: Deduplicating files with LogParser and SQL Server
- 25. February 2011: The final voyage of the USNS H. H. Hess
- 16. February 2011: Free SQL Server training videos
- 23. August 2010: Alert for long-running SQL datbase backups
- 7. April 2010: Learning SMO & Powershell
- 25. February 2010: SQL Generators for moving database files
- 28. January 2010: Index to Filegroup mapping
- 20. January 2010: PowerShell Script to Clean Up Old Files Based on Age
- 7. January 2010: Quick & Dirty way to identify orphan files
- May 2012
- April 2011
- February 2011
- August 2010
- April 2010
- February 2010
- January 2010
- July 2009
- June 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- August 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- October 2007
- September 2007
- May 2007
- April 2007
- February 2007
Archive for 14. April 2011
Deduplicating files with LogParser and SQL Server
14. April 2011 by admin.
Perhaps this should be categorized as a stupid people trick. On the other hand, it’s a good, cheap way to deduplicate files assuming you’re running SQL Server and LogParser. This post assumes at least a basic knowledge of T-SQL and LogParser command line syntax.
Basically, you use LogParser to create a two-column table containing file paths and MD5 hashes like this:
LogParser “SELECT path, HASHMD5_FILE(Path) AS MD5 INTO T_FILE_INFO FROM D:\some_path\*.*” -i:FS -o:SQL -database:files -driver:”SQL Server” -createTable:ON
The -o parameter directs LogParser to write the output to a SQL Server database table.
The T-SQL ranking function makes it very easy to find and pick out duplicate rows. Instead of finding duplicate rows, I use it to generate statements to delete duplicate files. The entire script looks something like this:
USE [TempDB]
GO
– Create LogParser command
DECLARE @LogParserCMD VARCHAR(8000);
SET @LogParserCMD = ‘LogParser “SELECT PATH, HASHMD5_FILE(Path) AS MD5 INTO #T_FILE_INFO FROM ”E:\some_path\*.*”” -i:FS -recurse:1 -o:SQL -database:files -driver:”SQL Server” -createTable:ON’
– Create table to store file paths and MD5 hashes
IF EXISTS (SELECT * FROM sys.objects WHERE OBJECT_ID = OBJECT_ID(N’[dbo].[#T_FILE_INFO]’) AND TYPE IN (N’U’))
DROP TABLE [dbo].[#T_FILE_INFO]
GO
CREATE TABLE [dbo].[#T_FILE_INFO](
[PATH] [varchar](512) NULL,
[MD5] [varchar](255) NULL
) ON [PRIMARY]
GO
– run LogParser to populate table. Either run from CMD prompt or use XP_CMDSHELL
– note: this step can be time consuming, since LogParser must generate an MD5 has for every file specified
EXEC xp_cmdshell @LogParserCMD;
–Get duplicate hashes; generate deletion commands
WITH tblCTE(MD5, PATH, ranking)AS
(SELECT MD5, PATH, Ranking =
DENSE_RANK() OVER(PARTITION BY MD5
ORDER BY NEWID() ASC)FROM #T_FILE_INFO)
SELECT ‘DEL “‘ + PATH + ‘”‘
FROM tblCTE
WHERE Ranking > 1
AND MD5 IS NOT NULL;
– Review the result set and run it from a command prompt
– clean up after yourself
DROP TABLE #T_FILE_INFO
Posted in Utilities, T-SQL, Windows OS | Print | No Comments »