python - 索引付けされていないテキストファイルで単語を検索する最速の方法 - Python

Question

150 万行、1 行あたり約 50 ～ 100 語のテキストファイルを考えてみます。

単語を含む行を見つけるには、使用os.popen('grep -w word infile')するよりも高速なようです

for line in infile: 
  if word in line:
    print line

Pythonでテキストファイル内の単語を検索するには、他にどのようにすればよいでしょうか? その大きな unindex テキストファイルを検索する最速の方法は何ですか?

score 1 · Accepted Answer

the_silver_searcherをインストールして使用することをお勧めします。

私のテストでは、約 2900 万行の約 1GB のテキストファイルを検索し、わずか 00h 00m 00.73 秒、つまり 1 秒未満で数百の検索語エントリを見つけました。

これを使用して単語を検索し、見つかった回数をカウントする Python 3 コードを次に示します。

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-wc", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE).stdout.read()
print("Found entries:", output.rstrip().decode('ascii'))

このバージョンは単語を検索し、行番号と実際のテキストを出力します。単語が見つかった場合:

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-w", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE)

for line in output.stdout.readlines():
    print(line.rstrip().decode('ascii'))

python - 索引付けされていないテキストファイルで単語を検索する最速の方法 - Python

2 に答える 2

Related

Reference