python - サブフォルダーからの Python ランダム行

Question

複数のサブフォルダーの .txt ファイルに多くのタスクがあります。これらのフォルダー、それらに含まれるファイル、および最後にファイル内のテキスト行から合計 10 個のタスクをランダムに取得しようとしています。選択した行は、次の実行で選択されないように削除またはマークする必要があります。これは広すぎる質問かもしれませんが、意見や指示をいただければ幸いです。

これまでのコードは次のとおりです。

#!/usr/bin/python  
import random   
with open('C:\\Tasks\\file.txt') as f:  
    lines = random.sample(f.readlines(),10)    
print(lines)

score 15 · Accepted Answer

これは、サンプルごとにファイルを 1 回だけ通過させる単純なソリューションです。ファイルからサンプリングするアイテムの数が正確にわかっている場合は、おそらく最適です。

まずはサンプル関数です。これは、以前の回答のコメントで@NedBatchelderがリンクしたのと同じアルゴリズムを使用します（ただし、そこに示されているPerlコードは、複数ではなく1行のみを選択しました）。反復可能な行から値を選択し、現在選択されている行だけを常にメモリに保持する必要があります (および次の候補行)。ValueErroriterable の値が要求されたサンプルサイズよりも少ない場合は、 a が発生します。

import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results

編集:別の質問で、ユーザー @DzinX は、非常に多数の値をサンプリングしている場合insert、このコードでを使用するとパフォーマンスが低下することに気付きました ( )。O(N^2)その問題を回避する彼の改良版はこちらです。/編集

次に、関数がサンプリングする適切な iterable アイテムを作成する必要があります。ジェネレーターを使用してそれを行う方法は次のとおりです。このコードは、一度に 1 つのファイルしか開いたままにせず、一度に複数のメモリ行を必要としません。オプションのexcludeパラメーターが存在する場合はset、以前の実行で選択された行を含む必要があります (したがって、再度生成されるべきではありません)。

import os

def lines_generator(base_folder, exclude = None):
    for dirpath, dirs, files in os.walk(base_folder):
        for filename in files:
            if filename.endswith(".txt"):
                fullPath = os.path.join(dirpath, filename)
                with open(fullPath) as f:
                     for line in f:
                         cleanLine = line.strip()
                         if exclude is None or cleanLine not in exclude:
                             yield cleanLine

ここで必要なのは、これら 2 つの部分を結び付ける (そして表示された一連の行を管理する) ラッパー関数だけです。ランダムサンプルからのスライスもランダムサンプルであるという事実を利用して、サイズの 1 つのサンプルnまたはサンプルのリストを返すことができます。count

_seen = set()

def get_sample(n, count = None):
    base_folder = r"C:\Tasks"
    if count is None:
        sample = random_sample(n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return sample
    else:
        sample = random_sample(count * n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return [sample[i * n:(i + 1) * n] for i in range(count)]

使用方法は次のとおりです。

def main():
    s1 = get_sample(10)
    print("Sample1:", *s1, sep="\n")

    s2, s3 = get_sample(10,2) # get two samples with only one read of the files
    print("\nSample2:", *s2, sep="\n")
    print("\nSample3:", *s3, sep="\n")

    s4 = get_sample(5000) # this will probably raise a ValueError!

score 4 · Accepted Answer

これらすべてのファイルに適切にランダムに分散させるには、それらを 1 つの大きな行セットとして表示し、ランダムに 10 個を選択する必要があります。つまり、少なくとも 1 回はこれらすべてのファイルを読み取って、少なくとも行数を把握する必要があります。

ただし、すべての行をメモリに保持する必要はありません。これは 2 つの段階で行う必要があります。ファイルにインデックスを付けてそれぞれの行数を数え、次にこれらのファイルから読み取る行をランダムに 10 行選択します。

最初の索引付け:

import os

root_path = r'C:\Tasks\\'
total_lines = 0
file_indices = dict()

# Based on https://stackoverflow.com/q/845058, bufcount function
def linecount(filename, buf_size=1024*1024):
    with open(filename) as f:
        return sum(buf.count('\n') for buf in iter(lambda: f.read(buf_size), ''))

for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
         if not filename.endswith('.txt'):
             continue
         path = os.path.join(dirpath, filename)
         file_indices[total_lines] = path
         total_lines += linecount(path)

offsets = list(file_indices.keys())
offsets.sort()

これで、オフセットのマッピング、ファイル名へのポイント、および合計行数が得られました。ここで、10 個のランダムなインデックスを選択し、ファイルからこれらを読み取ります。

import random
import bisect

tasks = list(range(total_lines))
task_indices = random.sample(tasks, 10)

for index in task_indices:
     # find the closest file index
     file_index = offsets[bisect.bisect(offsets, index) - 1]
     path = file_indices[file_index]
     curr_line = file_index
     with open(path) as f:
         while curr_line <= index:
             task = f.readline()
             curr_line += 1
     print(task)
     tasks.remove(index)

インデックス作成が必要なのは一度だけであることに注意してください。結果をどこかに保存し、ファイルが更新されたときにのみ更新できます。

また、タスクがリストに「保存」されていることにも注意してくださいtasks。これらはファイル内の行のインデックスであり、選択したタスクを印刷するときにその変数からインデックスを削除します。次にrandom.sample()選択肢を実行すると、以前に選択したタスクは次回の選択に使用できなくなります。インデックスを再計算する必要があるため、ファイルが変更された場合は、この構造を更新する必要があります。はそのfile_indicesタスクに役立ちますが、それはこの回答の範囲外です。:-)

10 項目のサンプルが1 つだけ必要な場合は、代わりにBlckknght のソリューションを使用してください。これは、ファイルを 1 回しか通過しないためです。複数のサンプルが必要な場合、このソリューションでは、サンプルが必要になるたびに 10 個の追加ファイルを開くだけで済み、すべてのファイルを再度スキャンすることはありません。ファイルが 10 個未満の場合でも、Blckknght の回答を使用してください。:-)

score 0 · Accepted Answer

編集：綿密に精査すると、この回答は法案に適合しません。それを作り直すと、@Blckknghtが彼の回答で使用したリザーバーサンプリングアルゴリズムにたどり着きました。したがって、この回答は無視してください。

それを行ういくつかの方法。ここに1つ...

すべてのタスクファイルのリストを取得する
ランダムに1つ選択
そのファイルからランダムに1行を選択します
必要な行数になるまで繰り返します

コード...

import os
import random

def file_iterator(top_dir):
    """Gather all task files"""
    files = []
    for dirpath, dirnames, filenames in os.walk(top_dir):
        for filename in filenames:
            if not filename.endswith('.txt'):
                continue
            path = os.path.join(dirpath, filename)
            files.append(path)
    return files


def random_lines(files, number=10):
    """Select a random file, select a random line until we have enough
    """
    selected_tasks = []

    while len(selected_tasks) < number:
        f = random.choice(files)
        with open(f) as tasks:
            lines = tasks.readlines()
            l = random.choice(lines)
            selected_tasks.append(l)
    return selected_tasks


## Usage
files = file_iterator(r'C:\\Tasks')
random_tasks = random_lines(files)

python - サブフォルダーからの Python ランダム行

3 に答える 3

Related

Reference