python - 重複ファイルを見つけて削除する

Question

フォルダーから重複ファイルを見つけて削除する Python プログラムを作成しています。

mp3 ファイルとその他のファイルの複数のコピーがあります。sh1 アルゴリズムを使用しています。

これらの重複ファイルを見つけて削除するにはどうすればよいですか?

score 111 · Accepted Answer

最速のアルゴリズム - 受け入れられた回答と比較して 100 倍のパフォーマンス向上 (本当に :))

他のソリューションのアプローチは非常に優れていますが、重複ファイルの重要な特性を忘れています。ファイルサイズは同じです。同じサイズのファイルに対してのみ高価なハッシュを計算すると、CPU が大幅に節約されます。最後に性能比較、ここに説明があります。

@nosklo によって与えられた確実な回答を反復し、@Raffi のアイデアを借りて各ファイルの先頭だけの高速ハッシュを作成し、高速ハッシュの衝突でのみ完全なハッシュを計算します。手順は次のとおりです。

ファイルサイズがキーとなるファイルのハッシュテーブルを作成します。
同じサイズのファイルの場合、最初の 1024 バイトのハッシュでハッシュテーブルを作成します。衝突しない要素はユニークです
最初の 1k バイトのハッシュが同じファイルの場合、コンテンツ全体のハッシュを計算します。一致するファイルは一意ではありません。

コード：

#!/usr/bin/env python
# if running in py3, change the shebang, drop the next import for readability (it does no harm in py3)
from __future__ import print_function   # py2 compatibility
from collections import defaultdict
import hashlib
import os
import sys


def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk


def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1):
    hashobj = hash()
    file_object = open(filename, 'rb')

    if first_chunk_only:
        hashobj.update(file_object.read(1024))
    else:
        for chunk in chunk_reader(file_object):
            hashobj.update(chunk)
    hashed = hashobj.digest()

    file_object.close()
    return hashed


def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
    hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
    hashes_full = {}   # dict of full_file_hash: full_path_to_file_string

    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            # get all files that have the same size - they are the collision candidates
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                try:
                    # if the target is a symlink (soft one), this will 
                    # dereference it - change the value to the actual target file
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    # not accessible (permissions, etc) - pass on
                    continue


    # For all files with the same file size, get their hash on the 1st 1024 bytes only
    for size_in_bytes, files in hashes_by_size.items():
        if len(files) < 2:
            continue    # this file size is unique, no need to spend CPU cycles on it

        for filename in files:
            try:
                small_hash = get_hash(filename, first_chunk_only=True)
                # the key is the hash on the first 1024 bytes plus the size - to
                # avoid collisions on equal hashes in the first part of the file
                # credits to @Futal for the optimization
                hashes_on_1k[(small_hash, size_in_bytes)].append(filename)
            except (OSError,):
                # the file access might've changed till the exec point got here 
                continue

    # For all files with the hash on the 1st 1024 bytes, get their hash on the full file - collisions will be duplicates
    for __, files_list in hashes_on_1k.items():
        if len(files_list) < 2:
            continue    # this hash of fist 1k file bytes is unique, no need to spend cpy cycles on it

        for filename in files_list:
            try: 
                full_hash = get_hash(filename, first_chunk_only=False)
                duplicate = hashes_full.get(full_hash)
                if duplicate:
                    print("Duplicate found: {} and {}".format(filename, duplicate))
                else:
                    hashes_full[full_hash] = filename
            except (OSError,):
                # the file access might've changed till the exec point got here 
                continue


if __name__ == "__main__":
    if sys.argv[1:]:
        check_for_duplicates(sys.argv[1:])
    else:
        print("Please pass the paths to check as parameters to the script")

そして、ここが楽しい部分です - パフォーマンスの比較です。

ベースライン -

1047 ファイルのディレクトリ、32 mp4、1015 - jpg、合計サイズ - 5445.998 MiB - つまり、私の携帯電話のカメラ自動アップロードディレクトリ:)
小さな (しかし完全に機能する) プロセッサ - 1600 BogoMIPS、1.2 GHz 32L1 + 256L2 Kbs キャッシュ、/proc/cpuinfo:

プロセッサ: Feroceon 88FR131 rev 1 (v5l) BogoMIPS: 1599.07

(つまり、私のローエンド NAS :)、Python 2.7.11 を実行しています。

したがって、@nosklo の非常に便利なソリューションの出力は次のとおりです。

root@NAS:InstantUpload# time ~/scripts/checkDuplicates.py 
Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg
Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg
Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg
Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg

real    5m44.198s
user    4m44.550s
sys     0m33.530s

そして、これはサイズチェックのフィルター、次に小さなハッシュ、最後に衝突が見つかった場合の完全なハッシュを含むバージョンです:

root@NAS:InstantUpload# time ~/scripts/checkDuplicatesSmallHash.py . "/i-data/51608399/photo/Todor phone"
Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg
Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg
Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg
Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg

real    0m1.398s
user    0m1.200s
sys     0m0.080s

必要な時間の平均を取得するために、両方のバージョンをそれぞれ 3 回実行しました。

したがって、v1 は (user+sys) 284sで、その他は - 2sです。かなりの違いですね :) この増加により、SHA512、またはさらに洗練されたものになる可能性があります-パフォーマンスのペナルティは、必要な計算が少なくなるため緩和されます.

ネガ:

他のバージョンよりも多くのディスクアクセス - すべてのファイルはサイズ統計のために 1 回アクセスされ (これは安価ですが、それでもディスク IO です)、すべての複製は 2 回開かれます (最初の小さな 1k バイトハッシュと完全なコンテンツハッシュのため)。
ハッシュテーブルのランタイムを保存するため、より多くのメモリを消費します

score 47 · Accepted Answer

再帰フォルダーのバージョン:

このバージョンでは、ファイルサイズとコンテンツのハッシュを使用して重複を検出します。複数のパスを渡すことができます。すべてのパスを再帰的にスキャンし、見つかったすべての重複を報告します。

import sys
import os
import hashlib

def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes = {}
    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                hashobj = hash()
                for chunk in chunk_reader(open(full_path, 'rb')):
                    hashobj.update(chunk)
                file_id = (hashobj.digest(), os.path.getsize(full_path))
                duplicate = hashes.get(file_id, None)
                if duplicate:
                    print "Duplicate found: %s and %s" % (full_path, duplicate)
                else:
                    hashes[file_id] = full_path

if sys.argv[1:]:
    check_for_duplicates(sys.argv[1:])
else:
    print "Please pass the paths to check as parameters to the script"

score 21 · Accepted Answer

def remove_duplicates(dir):
    unique = []
    for filename in os.listdir(dir):
        if os.path.isfile(filename):
            filehash = md5.md5(file(filename).read()).hexdigest()
            if filehash not in unique: 
                unique.append(filehash)
            else: 
                os.remove(filename)

//編集：

MP3 については、このトピックにも興味があるかもしれませんビットレートや ID3 タグが異なる重複 MP3 ファイルを検出しますか?

score 6 · Accepted Answer

しばらく前に Python で書いたものです。どうぞお使いください。

import sys
import os
import hashlib

check_path = (lambda filepath, hashes, p = sys.stdout.write:
        (lambda hash = hashlib.sha1 (file (filepath).read ()).hexdigest ():
                ((hash in hashes) and (p ('DUPLICATE FILE\n'
                                          '   %s\n'
                                          'of %s\n' % (filepath, hashes[hash])))
                 or hashes.setdefault (hash, filepath)))())

scan = (lambda dirpath, hashes = {}: 
                map (lambda (root, dirs, files):
                        map (lambda filename: check_path (os.path.join (root, filename), hashes), files), os.walk (dirpath)))

((len (sys.argv) > 1) and scan (sys.argv[1]))

score 6 · Accepted Answer

より高速なアルゴリズム

「大きなサイズ」の多くのファイル (画像、mp3、pdf ドキュメント) を分析する必要がある場合は、次の比較アルゴリズムを使用すると興味深い/高速になります。

最初の高速ハッシュは、ファイルの最初の N バイト (たとえば 1KB) に対して実行されます。このハッシュは、ファイルが間違いなく異なるかどうかを示しますが、2 つのファイルがまったく同じかどうかは示しません (ハッシュの精度、ディスクから読み取られるデータの制限)。
最初の段階で衝突が発生した場合、より正確でファイルのコンテンツ全体に対して実行される、より遅い 2 番目のハッシュ

このアルゴリズムの実装は次のとおりです。

import hashlib
def Checksum(current_file_name, check_type = 'sha512', first_block = False):
  """Computes the hash for the given file. If first_block is True,
  only the first block of size size_block is hashed."""
  size_block = 1024 * 1024 # The first N bytes (1KB)

  d = {'sha1' : hashlib.sha1, 'md5': hashlib.md5, 'sha512': hashlib.sha512}

  if(not d.has_key(check_type)):
    raise Exception("Unknown checksum method")

  file_size = os.stat(current_file_name)[stat.ST_SIZE]
  with file(current_file_name, 'rb') as f:
    key = d[check_type].__call__()
    while True:
      s = f.read(size_block)
      key.update(s)
      file_size -= size_block
      if(len(s) < size_block or first_block):
        break
  return key.hexdigest().upper()

def find_duplicates(files):
  """Find duplicates among a set of files.
  The implementation uses two types of hashes:
  - A small and fast one one the first block of the file (first 1KB), 
  - and in case of collision a complete hash on the file. The complete hash 
  is not computed twice.
  It flushes the files that seems to have the same content 
  (according to the hash method) at the end.
  """

  print 'Analyzing', len(files), 'files'

  # this dictionary will receive small hashes
  d = {}
  # this dictionary will receive full hashes. It is filled
  # only in case of collision on the small hash (contains at least two 
  # elements)
  duplicates = {}

  for f in files:

    # small hash to be fast
    check = Checksum(f, first_block = True, check_type = 'sha1')

    if(not d.has_key(check)):
      # d[check] is a list of files that have the same small hash
      d[check] = [(f, None)]
    else:
      l = d[check]
      l.append((f, None))

      for index, (ff, checkfull) in enumerate(l):

        if(checkfull is None):
          # computes the full hash in case of collision
          checkfull = Checksum(ff, first_block = False)
          l[index] = (ff, checkfull)

          # for each new full hash computed, check if their is 
          # a collision in the duplicate dictionary. 
          if(not duplicates.has_key(checkfull)):
            duplicates[checkfull] = [ff]
          else:
            duplicates[checkfull].append(ff)

  # prints the detected duplicates
  if(len(duplicates) != 0):
    print
    print "The following files have the same sha512 hash"

    for h, lf in duplicates.items():
      if(len(lf)==1):
        continue
      print 'Hash value', h
      for f in lf:
        print '\t', f.encode('unicode_escape') if \
          type(f) is types.UnicodeType else f
  return duplicates

このfind_duplicates関数はファイルのリストを受け取ります。このようにして、2 つのディレクトリを比較することもできます (たとえば、内容をより適切に同期するため)。指定された拡張子を持つファイルのリストを作成し、一部のディレクトリへの入力を回避する関数の例を以下に示します。

def getFiles(_path, extensions = ['.png'], 
             subdirs = False, avoid_directories = None):
  """Returns the list of files in the path :'_path', 
     of extension in 'extensions'. 'subdir' indicates if 
     the search should also be performed in the subdirectories. 
     If extensions = [] or None, all files are returned.
     avoid_directories: if set, do not parse subdirectories that 
     match any element of avoid_directories."""

  l = []
  extensions = [p.lower() for p in extensions] if not extensions is None \
    else None
  for root, dirs, files in os.walk(_path, topdown=True):

    for name in files:
      if(extensions is None or len(extensions) == 0 or \
         os.path.splitext(name)[1].lower() in extensions):
        l.append(os.path.join(root, name))

    if(not subdirs):
      while(len(dirs) > 0):
        dirs.pop()
    elif(not avoid_directories is None):
      for d in avoid_directories:
        if(d in dirs): dirs.remove(d)

  return l

このメソッドは、たとえばパスを解析しない.svn場合に便利です。これにより、ファイルの衝突が確実にトリガーされfind_duplicatesます。

フィードバックは大歓迎です。

score 3 · Accepted Answer

    import hashlib
    import os
    import sys
    from sets import Set

    def read_chunk(fobj, chunk_size = 2048):
        """ Files can be huge so read them in chunks of bytes. """
        while True:
            chunk = fobj.read(chunk_size)
            if not chunk:
                return
            yield chunk

    def remove_duplicates(dir, hashfun = hashlib.sha512):
        unique = Set()
        for filename in os.listdir(dir):
            filepath = os.path.join(dir, filename)
            if os.path.isfile(filepath):
                hashobj = hashfun()
                for chunk in read_chunk(open(filepath,'rb')):
                    hashobj.update(chunk)
                    # the size of the hashobj is constant
                    # print "hashfun: ", hashfun.__sizeof__()
                hashfile = hashobj.hexdigest()
                if hashfile not in unique:
                    unique.add(hashfile)
                else: 
                    os.remove(filepath)

    try:
        hashfun = hashlib.sha256
        remove_duplicates(sys.argv[1], hashfun)
    except IndexError:
        print """Please pass a path to a directory with 
        duplicate files as a parameter to the script."""

score 0 · Accepted Answer

安全にするために（何か問題が発生した場合、それらを自動的に削除すると危険になる可能性があります！）、@zalewの回答に基づいて、私が使用するものを次に示します。

また、md5 sum コードは @zalew のものとはわずかに異なることに注意してください。これは、彼のコードが間違った重複ファイルを生成しすぎたためです(そのため、それらを自動的に削除するのは危険だと言いました!)。

import hashlib, os
unique = dict()
for filename in os.listdir('.'):
    if os.path.isfile(filename):
        filehash = hashlib.md5(open(filename, 'rb').read()).hexdigest()

        if filehash not in unique: 
            unique[filehash] = filename
        else:
            print filename + ' is a duplicate of ' + unique[filehash]

python - 重複ファイルを見つけて削除する

10 に答える 10

最速のアルゴリズム - 受け入れられた回答と比較して 100 倍のパフォーマンス向上 (本当に :))

再帰フォルダーのバージョン:

より高速なアルゴリズム

Related

Reference