python - ファイル内の文字列検索の効率的な方法

Question

次のように2つのファイル「example」と「inp」がありました。

ファイル例の内容：

hi      wert    123

jui     fgrt    345

blabla  dfr     233

ファイルinpの内容：

jui
hi

'example'の各行の最初の列をフェッチする必要があります。その文字列がファイル'inp'に存在する場合は、'example'の行全体を別のファイルに書き込みます。out.txtこれは次のコードです。書かれていました：

f=file('example')
f1=file('inp')

for l in f.readlines():
    s=l.split()
    for p in f1.readlines():
            if s[0] in p:
                    print l >> 'out.txt'

期待通りの結果が得られません。また、ファイルの例には文字通り200000のエントリがあり、この種のプログラムには時間がかかりすぎると思います。タスクを正しく迅速に完了する方法はありますか？感謝します。ありがとうございました

score 2 · Accepted Answer

これはどうですか？最初に inp ファイルをロードし、次に example ファイルを反復処理して、inp から読み取った単語のリストに含まれる単語で始まる行のみを出力します。

with open('inp') as inpf:
    lines = [l.strip() for l in inpf]

with open('example') as exf, open('out.txt', 'w') as outf:
    for l in exf:
        if l.split(' ', 1)[0] in lines:
            print >>outf, l

setを使用して検索を高速化することもできます。セット内の検索の平均コストは O(1) です。withこれで最初のステートメントを変更するだけです：

with open('inp') as inpf:
    lines = set([l.strip() for l in inpf])

また、Python 3 を使用している場合printは、「古い」ステートメントの代わりに関数を使用します。

print(l, file=outf)

score 1 · Accepted Answer

A bit of optimization:

Use set for faster search
Split the lines from example just until the first space character
No additional new lines in the output file unlike when using print >> or print()

.

with open("inp") as f:
    a = set(l.rstrip() for l in f)

with open("out.txt", "w") as o, open("example") as f:
    for l in f:
        if l.split(" ", 1)[0] in a:
            o.write(l)

score 1 · Accepted Answer

「inp」が妥当なサイズである場合、すべての文字列をセットに読み取り、「example」の行を繰り返します。

(未テストの疑似コード)

words = set()
for line in inp:
  words.add(line)

for line in example:
  if line[0:line.find(' ')] in words:
    print line

メモリ内のセットルックアップは非常に高速で、各ファイルを一度だけ読み取ることができます。

score 0 · Accepted Answer

with open('inp') as inp: inp_words = set(line.strip() for line in inp)

with open('example') as example, open('result', 'w') as result:
    for line in example:
        if line.split()[0] in inp_words:
            result.write(line)

score 0 · Accepted Answer

ファイルの各行を繰り返し処理しています。試す：

s=l.split()
for line in f1.readlines():
    for p in line:
        if s[0] in p:
            print p, 'matches', s[0]

これを超高速で実行したい場合は、検索文字列の正規表現をコンパイルして、ファイルの文字列表現全体でそれを見つけてください。

HTH。

score 0 · Accepted Answer

これはどう？

with open('inp') as inf:
    words = inf.read()

with open('example') as inf, open('out.txt', 'w') as outf:
     for line in inf:
         word = line.split()[0]
         if word in words:
             outf.write(line)

収量：

hi wert 123
jui fgrt 345
jui hi

score -1 · Accepted Answer

-1

inp ファイルをソートしてから、バイナリ検索を試すことができます。

于 2012-05-24T17:20:12.893 に答える

python - ファイル内の文字列検索の効率的な方法

7 に答える 7

Related

Reference