python - この文字列トークン化関数をさらにプロファイリングして最適化するにはどうすればよいですか？

Question

ソースコードを見る方が簡単な場合は、私の長い説明をスキップしてください。

そこで、テキストの文字列をトークン化する関数を作成しました。最も単純なケースでは、のような文字列を受け取り、It's a beautiful morningトークンのリストを返します。前の例では、出力はになります['It', "'", 's', ' ', 'a', ' ', 'beautiful', ' ', 'morning']。

これは、関数の最初の2行で実現されます。

separators = dict.fromkeys(whitespace + punctuation, True)
tokens = [''.join(g) for _, g in groupby(phrase, separators.get)]

ここで注意すべきことは、It'sgetがに分割されること["It", "'", "s"]です。ほとんどの場合、これは問題ではありませんが、問題になる場合もあります。このため、stop_words「トークン化されていない」文字列のセットを取得するkwargを追加しました。例えば：

>>> tokenize("It's a beautiful morning", stop_words=set("It's"))
>>> ["It's", , ' ', 'a', ' ', 'beautiful', ' ', 'morning']

この「トークン化解除」は、トークンのリストを移動するスライディングウィンドウによって機能します。以下のスキーマを検討してください。ウィンドウは次のように表されます[]

Iteration 1:  ['It', "'",] 's', ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Iteration 2:  'It', ["'", 's',] ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Iteration 3:  'It', "'", ['s', ' ',] 'a', ' ', 'beautiful', ' ', 'morning'

各反復で、ウィンドウに含まれる文字列が結合され、の内容と照合されますstop_words。ウィンドウがトークンリストの最後に到達し、一致するものが見つからない場合、ウィンドウのサイズは1ずつ増加します。したがって、次のようになります。

Iteration 9:  ['It', "'", 's',] ' ', 'a', ' ', 'beautiful', ' ', 'morning'

ここに一致するものがあるので、ウィンドウ全体が単一の要素に置き換えられます：そのコンテンツが結合されます。したがって、反復9の終わりに、次のようになります。

"It's", ' ', 'a', ' ', 'beautiful', ' ', 'morning'

ここで、この新しいトークンが隣接するトークンと組み合わされてストップワードを形成する場合に備えて、最初からやり直す必要があります。アルゴリズムはウィンドウサイズを2に戻し、続行します。 ウィンドウサイズがトークンリストの長さに等しい反復の終了時に、プロセス全体が停止します。

この再帰は、私のアルゴリズムの非効率性の原因です。トークン化がほとんどない小さな文字列の場合、非常に高速に動作します。ただし、計算時間は、トークン化解除の数と元の文字列の全長に応じて指数関数的に増加するようです。

関数の完全なソースコードは次のとおりです。

from itertools import groupby, tee, izip
from string import punctuation, whitespace

def tokenize(phrase, stop_words=None):
    separators = dict.fromkeys(whitespace + punctuation, True)
    tokens = [''.join(g) for _, g in groupby(phrase, separators.get)]

    if stop_words:
        assert isinstance(stop_words, set), 'stop_words must be a set'
        window = 2  # Iterating over single tokens is useless
        while window <= len(tokens):
            # "sliding window" over token list
            iters = tee(tokens, window)
            for i, offset in izip(iters, xrange(window)):
                for _ in xrange(offset):
                    next(i, None)

            # Join each window and check if it's in `stop_words`
            for offset, tkgrp in enumerate(izip(*iters)):
                tk = ''.join(tkgrp)
                if tk in stop_words:
                    pre = tokens[0: offset]
                    post = tokens[offset + window + 1::]
                    tokens = pre + [tk] + post
                    window = 1  # will be incremented after breaking from loop
                    break

            window += 1

    return tokens

そして、ここにいくつかの難しい数字があります（いずれにせよ、私ができる最善のことです）。

>>> import cProfile
>>> strn = "it's a beautiful morning."
>>> ignore = set(["they're", "we'll", "she'll", "it's", "we're", "i'm"])
>>> cProfile.run('tokenize(strn * 100, ignore=ignore)')
cProfile.run('tokenize(strn * 100, ignore=ignore)')
         57534203 function calls in 15.737 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.405   10.405   15.737   15.737 <ipython-input-140-6ef74347708e>:1(tokenize)
        1    0.000    0.000   15.737   15.737 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method fromkeys}
      899    0.037    0.000    0.037    0.000 {itertools.tee}
      900    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   365450    1.459    0.000    1.459    0.000 {method 'join' of 'str' objects}
 57166950    3.836    0.000    3.836    0.000 {next}

このことから、実行時間の大部分は関数のスコープ内で行われていることがわかりました。前述のように、の絶え間ないリセットが非効率の原因であると私は思うwindowが、これをさらに診断する方法がわからない。

私の質問は次のとおりです。

windowこの関数をさらにプロファイリングして、それが実際にそのリセットが長い実行時間の原因であるかどうかを確認するにはどうすればよいですか？
パフォーマンスを向上させるために何ができますか？

よろしくお願いします！

score 1 · Accepted Answer

問題を誤解したかもしれませんが、分割する前に無視された単語を検索するだけで問題が解決するようです。

def tokenize(phrase, stop_words=()):
    stop_words = '|'.join(re.escape(x) + r'\b' for x in stop_words)
    other = '\s+|\w+|[^\s\w]+'
    regex = stop_words + '|' + other if stop_words else other
    return re.findall(regex, phrase)

Michael Andersonが指摘し\bているように、単語の一部が一致しないように追加する必要があります

編集：新しい正規表現は空白を句読点から分離します。

score 0 · Accepted Answer

私は正規表現に投票します！

トークンリストから句読点を除外することを気にしない場合は、次のことができます

import re
text = '''It's a beautiful morning''' 
tokens = re.split(text, ' ')

あなたにあげる

["It's", 'a', 'beautiful', 'morning']

すべての句読点を削除したい場合は、次のことができます

tokens = re.split(r'\W+', text)

戻る

[「イット」、「ス」、「ア」、「ビューティフル」、「モーニング」]

tokens = re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", text)

python - この文字列トークン化関数をさらにプロファイリングして最適化するにはどうすればよいですか？

2 に答える 2

Related

Reference