python - 重複するファイル名を検索し、Pythonを使用して最新のファイルのみを保持します

Question

+20000のファイルがあります。これらはすべて同じディレクトリにあります。

8003825.pdf
8003825.tif
8006826.tif

ファイル拡張子を無視して、重複するファイル名をすべて見つけるにはどうすればよいですか。

明確化：ファイル拡張子を無視して、同じファイル名のファイルである重複を参照します。ファイルが100％同じでなくてもかまいません（例：ハッシュサイズなど）

例えば：

"8003825" appears twice

次に、重複する各ファイルのメタデータを確認し、最新のファイルのみを保持します。

この投稿に似ています：

最新のファイルを保持し、他のすべてを削除します

すべてのファイルのリストを作成し、ファイルがすでに存在するかどうかを確認する必要があると思います。その場合は、os.statを使用して変更日を決定しますか？

これらすべてのファイル名をメモリにロードすることについて少し心配しています。そして、物事を行うためのよりPython的な方法があるかどうか疑問に思っています...

Python 2.6 Windows 7

score 7 · Accepted Answer

あなたはO(n)複雑にそれを行うことができます。のソリューションsortはO(n*log(n))複雑です。

import os
from collections import namedtuple

directory = #file directory
os.chdir(directory)

newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file = newest_files.get(name)
    this_file_date = os.path.getmtime(file_name)
    if cashed_file is None:
        newest_files[name] = Entry(this_file_date,file_name)
    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)

newest_filesは、ファイルの完全なファイル名と変更日を保持する名前付きタプルの値を持つキーとして、拡張子のないファイル名を持つ辞書です。検出された新しいファイルが辞書内にある場合、その日付は辞書に保存されているファイルと比較され、必要に応じて置き換えられます。

最終的には、最新のファイルを含む辞書ができあがります。

次に、このリストを使用して2番目のパスを実行できます。辞書でのルックアップの複雑さはであることに注意してくださいO(1)。nしたがって、辞書内のすべてのファイルを検索する全体的な複雑さはですO(n)。

たとえば、同じ名前の最新のファイルのみを残し、他のファイルを削除する場合、これは次の方法で実行できます。

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file_name = newest_files.get(name).file_name
    if file_name != cashed_file_name: #it's not the newest with this name
        os.remove(file_name)

コメントでBlckknghtが示唆しているように、コードを1行追加するだけで、2番目のパスを回避し、新しいファイルに遭遇したらすぐに古いファイルを削除することもできます。

    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)
            os.remove(cashed_file.file_name) #this line added

score 2 · Accepted Answer

まず、ファイル名のリストを取得して並べ替えます。これにより、重複が隣り合わせに配置されます。

次に、ファイル拡張子を取り除き、近隣のものと比較しos.path.splitext()ますitertools.groupby()。ここで役立つ場合があります。

重複をグループ化したら、使用し続けるものを選択しますos.stat()。

最終的に、コードは次のようになります。

import os, itertools

files = os.listdir(base_directory)
files.sort()
for k, g in itertools.groupby(files, lambda f: os.path.splitext(f)[0]):
     dups = list(g)
     if len(dups) > 1:
         # figure out which file(s) to remove

ここではメモリについて心配する必要はありません。数メガバイト程度のものを見ているからです。

score 0 · Accepted Answer

ファイル名カウンターには、各ファイルが表示される回数を格納するdefaultdictを使用できます。

import os
from collections import defaultdict

counter = defaultdict(int)
for file_name in file_names:
   file_name = os.path.splitext(os.path.basename(file_name))[0]
   counter[file_name] += 1

python - 重複するファイル名を検索し、Pythonを使用して最新のファイルのみを保持します

3 に答える 3

Related

Reference