talideon.com

Blackout Ireland

February 11, 2008 at 3:11PM From my ~/bin: find-duplicates

This is is starting to become a bit of a series! I needed a small, sharp tool for finding what duplicate files I had sitting on my external harddrive. I’ve lot of stuff which, over time, has got inadvertently duplicated, or downloaded twice, such as photos, software archives, sound files, papers and articles, &c. It needed to be fast and small, capable of having several directories searched at once (e.g., my home directory and some directory on the external HDD), and grepable. Here’s the result:

#!/usr/bin/env python
#
# find-duplicates
# by Keith Gaughan <http://talideon.com/>
#
# Finds an lists any duplicate files in the given directories.
#
# Copyright (c) Keith Gaughan, 2008.
# All Rights Reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
#  1. Redistributions of source code must retain the above copyright
#     notice, this list of conditions and the following disclaimer.
#
#  2. Redistributions in binary form must reproduce the above copyright
#     notice, this list of conditions and the following disclaimer in the
#     documentation and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
# THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# This license is subject to the laws and courts of the Republic of Ireland.
#

from __future__ import with_statement
import sys
import os
import hashlib
import getopt
import filecmp


USAGE = "Usage: %s [-h] [-m<crc|md5>] <dir>*"


class crc:
    """
    Wraps up zlib.crc32 to make it suitable for use as a faster but less
    accurate alternative to the hashlib.* classes.
    """
    def __init__(self, initial=None):
        self.crc = 0
        if initial is not None:
            self.update(initial)
    def update(self, block):
        import zlib
        self.crc = zlib.crc32(block, self.crc)
    def hexdigest(self):
        return "%X" % self.crc
    def digest(self):
        # Er...
        return self.crc


def all_files(*tops):
    """Lists all files in the given directories."""
    for top in tops:
        for dirname, _, filenames in os.walk(top):
            for f in filenames:
                path = os.path.join(dirname, f)
                if os.path.isfile(path):
                    yield path


def digest(file, method=hashlib.md5):
    with open(file) as f:
        h = method(f.read()).digest()
    return h


def true_duplicates(files):
    """
    Compare the given files, breaking them down into groups with identical
    content.
    """
    while len(files) > 1:
        next_set = []
        this_set = []
        master = files[0]
        this_set.append(master)
        for other in files[1:]:
            if filecmp.cmp(master, other, False):
                this_set.append(other)
            else:
                next_set.append(other)
        if len(this_set) > 1:
            yield this_set
        files = next_set


def group_by(groups, grouper, min_size=1):
    """Breaks each of the groups into smaller subgroups."""
    for group in groups:
        subgroups = {}
        for item in group:
            g = grouper(item)
            if not subgroups.has_key(g):
                subgroups[g] = []
            subgroups[g].append(item)
        for g in subgroups.itervalues():
            if len(g) >= min_size:
                yield g


def usage(message=None):
    global USAGE
    fh = sys.stdout
    exit_code = 0
    if message:
        fh = sys.stderr
        exit_code = 2
        print >>fh, str(message)
    name = os.path.basename(sys.argv[0])
    print >>fh, USAGE % (name,)
    sys.exit(exit_code)


def main():
    try:
        opts, paths = getopt.getopt(sys.argv[1:], "hm:")
    except getopt.GetoptError, err:
        usage(err)
    method = crc
    for o, a in opts:
        if o == "-m":
            if a == "crc":
                method = crc
            elif a == "md5":
                method = hashlib.md5
            else:
                usage("Unknown grouping method: %s" % (a,))
        elif o == "-h":
            usage()
        else:
            usage("Unknown option: %s%s" % (o, a))

    if len(paths) == 0:
        paths = ["."]

    first = True
    groups = [all_files(*paths)]
    for grouper in [os.path.getsize, lambda file: digest(file, method)]:
        groups = group_by(groups, grouper, 2)
    for group in groups:
        for files in sorted(true_duplicates(group)):
            if first:
                first = False
            else:
                print
            for file in files:
                print file


if __name__ == "__main__":
    main()

It’s pretty simple, really. It walks the directories listed on the command line, sorting the files it finds into groups of files likely to be similar. The default criterion is to group them based on file size, which is quick and usually a good indicator that the files are the same. If it’s not, such as if you’re dealing with a bunch of RAW images or TARGA files that are all the same size, other criteria can be used, such as the result of a CRC (Adler-32, specifically) ran on the files, or their MD5 hash. CRCs might be useless for hashing, but they’re much quicker than a proper hash, and give a reasonable indication of whether the files are same. After that, each file in the group is explicitly compared for equality to ensure that they really are duplicates, and broken into subgroups. This ensures that you won’t accidentally delete anything that might appear to be a duplicate, but really isn’t.

When it’s finds identical files, they’re listed together, one per line, in groups separated by empty lines.

Update (Feb 12th): Got rid of some stupid inefficiencies. I’d forgotten that the read() method on filehandles will read everything if no argument’s provided. This speeds it up somewhat when using the CRC or MD5 methods. For the size-based method, I think I’m going to introduce some extra grouping based on one of the other two methods to avoid needless file comparisons. Oh, and whittle() returns a generator now rather than a list.

Update (Feb 14th): Threat carried out! whittle() has been replaced by a more general method called group_by() which is similar but accepts an iterable of iterables instead of an iterable, and rather than returning a generator, it is a generator. Also, after the initial grouping by size, it groups by CRC, though this can be changed to use MD5 with a switch. Now the initial cheap grouping by size is always done, and the more expensive grouping methods (by hash/checksum, then direct comparison) are done after. This is much faster.

Comments

1 On February 12, 2008 at 5:18, Revence wrote:

I like this one. First good thing: size is O(1), so you can use it madly, and it is a good guide for ruling out possibility of similarity. Next, CRCs are just fine for such a thing. Most of the times people have used hashes they needed checksums (and wasted lots of cycles). The difference is even often ignored. That was good. Really nice.

I would have replaced one I wrote with this one, but the other was for weeding out duplicated MP3s in a mountain I wasn’t going to tackle by hand. It had to watch that ID3 tags (both v1 and v2) didn’t mislead it. In any case, it was on a music player, which didn’t care about ID3v2, so the script also cut off the ID3v2 (need every meg on a player). Argh; unnecessary tangent. :o)

Anyway, it is good to see such beautiful Python from the guy who told me to do Python (back, back then, I guess you’ve already forgotten that mail that was, for me, a very pivotal point). Even though I moved on to Ruby, still. :o)