Friday, April 29, 2011

Finding duplicate files and removing them.

hello there to all, well, I am not a Programming guru in python but manage to write some of them :) The one I am writing is finding and removing the duplicate files from the folder. I have multiple copies of mp3 files, and same is the case with other files. So if anyone can help solving this problem then I will be thankful. I am using sh1 algorithm.

Thanking you all in advance :)

From stackoverflow
  • I wrote one in Python some time ago -- you're welcome to use it.

    check_path = (lambda filepath, hashes, p = sys.stdout.write:
            (lambda hash = hashlib.sha1 (file (filepath).read ()).hexdigest ():
                    ((hash in hashes) and (p ('DUPLICATE FILE\n'
                                              '   %s\n'
                                              'of %s\n' % (filepath, hashes[hash])))
                     or hashes.setdefault (hash, filepath)))())
    
    scan = (lambda dirpath, hashes = {}: 
                    map (lambda (root, dirs, files):
                            map (lambda filename: check_path (os.path.join (root, filename), hashes), files), os.walk (dirpath)))
    
    ((len (sys.argv) > 1) and scan (sys.argv[1]))
    
    altCognito : +1 Ok, that's pretty neat.
    tgray : I can't follow what's happening there. If you get a chance, could you maybe explain a little of what's going on?
    nosklo : -1: unreadable code
    S.Lott : -1: Lambdas (ToT)
    cobbal : +0: looks like lisp
    Anonymous : I wonder if the author is furious for being downvoted :)
    Joe : +1: looks like lisp and answers the question.
    John Millikin : Anon -- I posted this just to be silly. It's a rather difficult-to-read solution to an easy problem. The original question could probably be solved in a few lines of idiomatic Python, if the original poster bothered to spend a few minutes of quality time with Google.
  • def remove_duplicates(dir):
       unique = []
       for filename in os.listdir(dir)
          if os.path.isfile(filename):
            filehash = md5.md5(file(filename).read()).hexdigest()
            if filehash not in unique: unique.append(filehash)
            else: os.remove(filename)
    

    //edit:

    for mp3 you may be also interested in this topic http://stackoverflow.com/questions/476227/detect-duplicate-mp3-files-with-different-bitrates-and-or-different-id3-tags

    zalew : of course you can replace the md5 hashing with sha1
    Brian : For performance, you should probably change unique to be a set (though it probably won't be a big factor unless there are lots of small files). Also, your code will fail if there is a directory in the dir. Check os.path.isfile() before you process them.
    zalew : yep, this code it's more like a basis. I added isfile as you suggested.
  • Recursive folders version:

    This version uses the file size and a hash of the contents to find duplicates. You can pass it multiple paths, it will scan all paths recursively and report all duplicates found.

    import sys
    import os
    import hashlib
    
    def chunk_reader(fobj, chunk_size=1024):
        """Generator that reads a file in chunks of bytes"""
        while True:
            chunk = fobj.read(chunk_size)
            if not chunk:
                return
            yield chunk
    
    def check_for_duplicates(paths, hash=hashlib.sha1):
        hashes = {}
        for path in paths:
            for dirpath, dirnames, filenames in os.walk(path):
                for filename in filenames:
                    full_path = os.path.join(dirpath, filename)
                    hashobj = hash()
                    for chunk in chunk_reader(open(full_path, 'rb')):
                        hashobj.update(chunk)
                    file_id = (hashobj.digest(), os.path.getsize(full_path))
                    duplicate = hashes.get(file_id, None)
                    if duplicate:
                        print "Duplicate found: %s and %s" % (full_path, duplicate)
                    else:
                        hashes[file_id] = full_path
    
    if sys.argv[1:]:
        check_for_duplicates(sys.argv[1:])
    else:
        print "Please pass the paths to check as parameters to the script"
    

0 comments:

Post a Comment