February 11, 2008 at 3:11PM From my ~/bin: find-duplicates
This is is starting to become a bit of a series! I needed a small, sharp tool for finding what duplicate files I had sitting on my external harddrive. I’ve lot of stuff which, over time, has got inadvertently duplicated, or downloaded twice, such as photos, software archives, sound files, papers and articles, &c. It needed to be fast and small, capable of having several directories searched at once (e.g., my home directory and some directory on the external HDD), and grepable. Here’s the result:
#!/usr/bin/env python
#
# find-duplicates
# by Keith Gaughan <http://talideon.com/>
#
# Finds an lists any duplicate files in the given directories.
#
# Copyright (c) Keith Gaughan, 2008.
# All Rights Reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
# THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# This license is subject to the laws and courts of the Republic of Ireland.
#
from __future__ import with_statement
import sys
import os
import hashlib
import getopt
import filecmp
USAGE = "Usage: %s [-h] [-m<crc|md5>] <dir>*"
class crc:
"""
Wraps up zlib.crc32 to make it suitable for use as a faster but less
accurate alternative to the hashlib.* classes.
"""
def __init__(self, initial=None):
self.crc = 0
if initial is not None:
self.update(initial)
def update(self, block):
import zlib
self.crc = zlib.crc32(block, self.crc)
def hexdigest(self):
return "%X" % self.crc
def digest(self):
# Er...
return self.crc
def all_files(*tops):
"""Lists all files in the given directories."""
for top in tops:
for dirname, _, filenames in os.walk(top):
for f in filenames:
path = os.path.join(dirname, f)
if os.path.isfile(path):
yield path
def digest(file, method=hashlib.md5):
with open(file) as f:
h = method(f.read()).digest()
return h
def true_duplicates(files):
"""
Compare the given files, breaking them down into groups with identical
content.
"""
while len(files) > 1:
next_set = []
this_set = []
master = files[0]
this_set.append(master)
for other in files[1:]:
if filecmp.cmp(master, other, False):
this_set.append(other)
else:
next_set.append(other)
if len(this_set) > 1:
yield this_set
files = next_set
def group_by(groups, grouper, min_size=1):
"""Breaks each of the groups into smaller subgroups."""
for group in groups:
subgroups = {}
for item in group:
g = grouper(item)
if not subgroups.has_key(g):
subgroups[g] = []
subgroups[g].append(item)
for g in subgroups.itervalues():
if len(g) >= min_size:
yield g
def usage(message=None):
global USAGE
fh = sys.stdout
exit_code = 0
if message:
fh = sys.stderr
exit_code = 2
print >>fh, str(message)
name = os.path.basename(sys.argv[0])
print >>fh, USAGE % (name,)
sys.exit(exit_code)
def main():
try:
opts, paths = getopt.getopt(sys.argv[1:], "hm:")
except getopt.GetoptError, err:
usage(err)
method = crc
for o, a in opts:
if o == "-m":
if a == "crc":
method = crc
elif a == "md5":
method = hashlib.md5
else:
usage("Unknown grouping method: %s" % (a,))
elif o == "-h":
usage()
else:
usage("Unknown option: %s%s" % (o, a))
if len(paths) == 0:
paths = ["."]
first = True
groups = [all_files(*paths)]
for grouper in [os.path.getsize, lambda file: digest(file, method)]:
groups = group_by(groups, grouper, 2)
for group in groups:
for files in sorted(true_duplicates(group)):
if first:
first = False
else:
print
for file in files:
print file
if __name__ == "__main__":
main()
It’s pretty simple, really. It walks the directories listed on the command line, sorting the files it finds into groups of files likely to be similar. The default criterion is to group them based on file size, which is quick and usually a good indicator that the files are the same. If it’s not, such as if you’re dealing with a bunch of RAW images or TARGA files that are all the same size, other criteria can be used, such as the result of a CRC (Adler-32, specifically) ran on the files, or their MD5 hash. CRCs might be useless for hashing, but they’re much quicker than a proper hash, and give a reasonable indication of whether the files are same. After that, each file in the group is explicitly compared for equality to ensure that they really are duplicates, and broken into subgroups. This ensures that you won’t accidentally delete anything that might appear to be a duplicate, but really isn’t.
When it’s finds identical files, they’re listed together, one per line, in groups separated by empty lines.
Update (Feb 12th): Got rid of some stupid inefficiencies. I’d forgotten that the read() method on filehandles will read everything if no argument’s provided. This speeds it up somewhat when using the CRC or MD5 methods. For the size-based method, I think I’m going to introduce some extra grouping based on one of the other two methods to avoid needless file comparisons. Oh, and whittle() returns a generator now rather than a list.
Update (Feb 14th): Threat carried out! whittle() has been replaced by a more general method called group_by() which is similar but accepts an iterable of iterables instead of an iterable, and rather than returning a generator, it is a generator. Also, after the initial grouping by size, it groups by CRC, though this can be changed to use MD5 with a switch. Now the initial cheap grouping by size is always done, and the more expensive grouping methods (by hash/checksum, then direct comparison) are done after. This is much faster.
1 On February 12, 2008 at 5:18, Revence wrote:
I like this one. First good thing: size is O(1), so you can use it madly, and it is a good guide for ruling out possibility of similarity. Next, CRCs are just fine for such a thing. Most of the times people have used hashes they needed checksums (and wasted lots of cycles). The difference is even often ignored. That was good. Really nice.
I would have replaced one I wrote with this one, but the other was for weeding out duplicated MP3s in a mountain I wasn’t going to tackle by hand. It had to watch that ID3 tags (both v1 and v2) didn’t mislead it. In any case, it was on a music player, which didn’t care about ID3v2, so the script also cut off the ID3v2 (need every meg on a player). Argh; unnecessary tangent. :o)
Anyway, it is good to see such beautiful Python from the guy who told me to do Python (back, back then, I guess you’ve already forgotten that mail that was, for me, a very pivotal point). Even though I moved on to Ruby, still. :o)