python - Python を使用してリスト内の文字列一致数を検索する

Question

分析のためにテキストファイルを読み取ります。各単語はリストに追加され、ID が与えられます。

#!/usr/bin/python3
with fi as myfile:
  for line in myfile:
    for item in line.split(' '):
      db[0].append(id_+1)
      db[2].append(item)
      ...more stuff

次に、リスト内の各単語を検索して一致するものを見つけ、カウントをsim1として保存します。一致が見つかった場合、次の単語が連続する単語と一致するかどうかをテストし、そのカウントをsim2として保存します。sim3についても同様です。私のコードは次のようになります:

for i in range(id_-3):
  sim1=0
  sim2=0
  sim3=0
  for j in range(id_-3):
    if i==j:  continue;
    if db[2][i] == db[2][j]:
      sim1 += 1
      if db[2][i+1] == db[2][j+1]:
        sim2 += 1
        if db[2][i+2] == db[2][j+2]:
          sim3 += 1
  db[3].append(sim1)
  db[4].append(sim2)
  db[5].append(sim3)

これは機能しますが、遅すぎます。Python の方が高速な検索方法を提供すると思いますが、私はまだ Py の初心者です!

score 2 · Accepted Answer

アルゴリズムの遅さは主に、len(db[2]) 回反復する外部ループ内に含まれる len(db[2]) 回反復する内部ループがあるという事実に起因します。これは、内部コードが len(db[2])^2 回実行されていることを意味します。ファイルが大きく、たとえば 5000 語を解析している場合、コードは 5000^2 = 25,000,000 回実行されます!

したがって、問題を解決するための攻撃の角度は、その内部ループのコストを排除または大幅に削減する方法を見つけることです。以下は、 len(db[2]) を 1 回だけ反復する必要があるソリューションの例です。次に、はるかに小さな項目セットを反復する 2 番目の別のループを実行します。2 番目の反復内にいくつかの内部ループがありますが、それらの実行回数はさらに少なく、コストはほとんど取るに足らないものです。

約 48kb のテキストファイルを使用して、あなたのアルゴリズムと私のアルゴリズムの時間を計りました。あなたのアルゴリズムは私のコンピューターで平均約 14 秒、私のアルゴリズムは平均 0.6 秒でした。したがって、その内側のループを取り除くことで、アルゴリズムは 23 倍以上高速になりました。また、比較をテキストではなく数値間で行うように変更したり、append() の使用を避けるためにストレージ配列を最初からフルサイズで作成したりするなど、その他の小さな最適化もいくつか行いました。Append() を使用すると、インタープリターは必要に応じて配列のサイズを動的に増やしますが、これは遅くなります。

from collections import defaultdict

# Create zero-filled sim1, sim2, sim3 arrays to avoid append() overhead
len_ = len(db[2]) - 2
for _ in range(3):
    db.append([0] * len_)

# Create dictionary, containing d['word'] = [count, [indexes]]
# Do just one full iteration, and make good use of it by calculating
# sim1 (as 'count') and storing an array of number indexes for each word,
# allowing for a very efficient loop coming up...
d = defaultdict(lambda: [0, []])
for index, word in enumerate(db[2]):
    if index < len_:
        # Accumulate sim1
        d[word][0] += 1 
    # Store all db[2] indexes where this word exists
    d[word][1].append(index) 

# Now loop only through words which occur more than once (smaller loop)
for word, (count, indexes) in d.iteritems():
    if count > 1:
        # Place the sim1 values into the db[3] array
        for i in indexes:
            if i < len_:
                db[3][i] = count - 1
                # Look for sim2 matches by using index numbers
                next_word = db[2][i+1]
                for next_word_index in d[next_word][1]:
                    if next_word_index - 1 != i and next_word_index - 1 in indexes:
                        # Accumulate sim2 value in db[4]
                        db[4][i] += 1
                        # Look for sim3 matches
                        third_word = db[2][i+2]
                        if third_word == db[2][next_word_index + 1]:
                            # Accumulate sim3 value in db[5]
                            db[5][i] += 1

score -2 · Accepted Answer

Yep, you're performaing a string compare. That's really slow. What you want is to compile your string as a regular pattern. :)

Have a look onto the libary re from python. Python: re

python - Python を使用してリスト内の文字列一致数を検索する

2 に答える 2

Related

Reference