python - Python でストップワードを削除するより高速な方法

Question

テキストの文字列からストップワードを削除しようとしています:

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

私はそのような弦を 6 ミル処理しているので、速度が重要です。私のコードのプロファイリング、最も遅い部分は上記の行です。これを行うより良い方法はありますか? 正規表現のようなものを使用することを考えてre.subいますが、一連の単語のパターンを記述する方法がわかりません。誰かが私に手を差し伸べることができますか？他のおそらくより速い方法を聞いてうれしいです。

注：誰かが提案したラッピングを試みstopwords.words('english')ましset()たが、違いはありませんでした。

ありがとうございました。

score 103 · Accepted Answer

以下に示すように、ストップワードオブジェクトをキャッシュしてみてください。関数を呼び出すたびにこれを構築することがボトルネックのようです。

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

これをプロファイラーで実行しました：python -m cProfile -scumulative test.py。関連する行を以下に掲載します。

nCall 累積時間

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

したがって、ストップワードインスタンスをキャッシュすると、最大 70 倍のスピードアップが得られます。

score 5 · Accepted Answer

まず、文字列ごとにストップワードを作成します。一度作成してください。セットは確かにここで素晴らしいでしょう。

forbidden_words = set(stopwords.words('english'))

[]後で、 insideを取り除きjoinます。代わりにジェネレーターを使用してください。

交換

' '.join([x for x in ['a', 'b', 'c']])

と

' '.join(x for x in ['a', 'b', 'c'])

次に対処することは.split()、配列を返すのではなく、yield 値を作成することです。~~私はregexここで良い代替品になると信じています。~~が実際に速い理由については、このスレッドを参照してください。s.split()

最後に、このような作業を並行して行います (6m 文字列のストップワードを削除します)。それはまったく別のトピックです。

score 0 · Accepted Answer

ループを避け、代わりに正規表現を使用してストップワードを削除することで、これを使用してみてください。

import re
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)

python - Python でストップ ワードを削除するより高速な方法

6 に答える 6

Related

Reference

python - Python でストップワードを削除するより高速な方法