python - いくつかのワイルドカードを含む大きなリストでのメンバーシップテスト

Question

リストに特別なカテゴリが含まれている場合、そのフレーズが大規模な (650k) リストに含まれているかどうかをテストするにはどうすればよいですか?

たとえば、フレーズ["he", "had", "the", "nerve"]がリストに含まれているかどうかをテストしたいとします。ありますが、その下["he", "had", "!DETERMINER", "nerve"]に"!DETERMINER"は、複数の選択肢を含むワードクラスの名前があります(a, an, the)。私は約 350 の単語クラスを持っていますが、そのうちのいくつかは非常に長いので、1 つ (または複数) の単語クラスを持つリスト内の各項目を列挙することは現実的ではないと思います。

リストをゆっくりと処理する代わりに、これらのフレーズのセットを使用したいのですが、ワードクラスの可変性に対処する方法がわかりません。この比較を毎回何十万回も行う必要があるため、速度は非常に重要です。

score 1 · Accepted Answer

pjwerneck の提案と同様に、ツリー (またはより具体的にはtrie ) を使用してリストを部分的に格納できますが、それを拡張してカテゴリを特別に扱うことができます。

# phrase_trie.py

from collections import defaultdict

CATEGORIES = {"!DETERMINER": set(["a","an","the"]),
              "!VERB": set(["walked","talked","had"])}

def get_category(word):
    for name,words in CATEGORIES.items():
        if word in words:
            return name
    return None

class PhraseTrie(object):
    def __init__(self):
        self.children = defaultdict(PhraseTrie)
        self.categories = defaultdict(PhraseTrie)

    def insert(self, phrase):
        if not phrase: # nothing to insert
            return

        this=phrase[0]
        rest=phrase[1:]

        if this in CATEGORIES: # it's a category name
            self.categories[this].insert(rest)
        else:
            self.children[this].insert(rest)

    def contains(self, phrase):
        if not phrase:
            return True # the empty phrase is in everything

        this=phrase[0]
        rest=phrase[1:]

        test = False

        # the `if not test` are because if the phrase satisfies one of the
        # previous tests we don't need to bother searching more

        # allow search for ["!DETERMINER", "cat"]
        if this in self.categories: 
            test = self.categories[this].contains(rest)

        # the word is literally contained
        if not test and this in self.children:
            test = self.children[this].contains(rest)

        if not test:
            # check for the word being in a category class like "a" in
            # "!DETERMINER"
            cat = get_category(this)
            if cat in self.categories:
                test = self.categories[cat].contains(rest)
        return test

    def __str__(self):
        return '(%s,%s)' % (dict(self.children), dict(self.categories))
    def __repr__(self):
        return str(self)

if __name__ == '__main__':
    words = PhraseTrie()
    words.insert(["he", "had", "!DETERMINER", "nerve"])
    words.insert(["he", "had", "the", "evren"])
    words.insert(["she", "!VERB", "the", "nerve"])
    words.insert(["no","categories","here"])

    for phrase in ("he had the nerve",
                   "he had the evren",
                   "she had the nerve",
                   "no categories here",
                   "he didn't have the nerve",
                   "she had the nerve more"):
        print '%25s =>' % phrase, words.contains(phrase.split())

実行中python phrase_trie.py:

         he had the nerve => True
         he had the evren => True
        she had the nerve => True
       no categories here => True
 he didn't have the nerve => False
   she had the nerve more => False

コードに関するいくつかのポイント:

を使用するのdefaultdictは、を呼び出す前にそのサブトライが存在するかどうかを確認する必要がないようにするためinsertです。必要に応じて自動的に作成および初期化されます。
への呼び出しが多い場合はget_category、高速化のために逆引き辞書を作成する価値があるかもしれません。(または、より良い方法として、の呼び出しをメモしてget_category、一般的な単語の検索が高速になるようにしますが、検索しない単語を格納するメモリを無駄にしないようにします。)
このコードは、各単語が 1 つのカテゴリにのみ含まれていることを前提としています。(そうでない場合、唯一の変更は、リストと、このリストをループget_categoryする関連セクションを返すことです。)PhraseTrie

score 0 · Accepted Answer

まず、2 つdictのを作成します。

partOfSpeech = {'a':'!DETERMINER', 'an':'!DETERMINER', 'the':'!DETERMINER'}
words = {'!DETERMINER': set(['a', 'an', 'the'])}

これにより、少なくとも速度が向上するはずです。これによりどれだけ高速化されるか見てみましょう。不十分な場合は、コメントを残してください。改善を試みます (または、SO コミュニティの他の人が私のソリューションを改善したり、より良いソリューションを提供したりできるかもしれません)。

score 0 · Accepted Answer

速度が重要で、単語クラスを処理する必要がある場合は、フレーズと単語クラスのリストを保存する代わりに、子の深さがフレーズ内の位置になるように、単語の木として保存することを検討する必要があります。そうすれば、各レベルを簡単に照会でき、上に移動するにつれて検索が絞り込まれます。単語が見つからない場合は、そのフレーズがリストにないことがわかります。

これは非常に単純な実装であり、例にすぎません。このようにネストされた dict として実装することもできますが、データが大きく動的である場合は、おそらくデータベースを使用する必要があります。

tree = {'he':
        {'had':
         {'the': {'nerve': {}, 'power': {}, 'gift': {}},
          'a': {'car': {}, 'bike': {}},
          'an': {'airplane': {}, 'awful': {'smell': {}}}
          }
         }
        }


def find_phrase(phrase, root):    
    if not phrase:
        return True

    try:
        next = root[phrase[0]]
        return find_phrase(phrase[1:], next)
    except KeyError:
        return False

    return False


assert find_phrase(['he', 'had', 'the', 'nerve'], tree) is True
assert find_phrase(['he', 'had', 'the', 'power'], tree) is True
assert find_phrase(['he', 'had', 'the', 'car'], tree) is False
assert find_phrase(['he', 'had', 'an', 'awful', 'smell'], tree) is True
assert find_phrase(['he', 'had', 'an', 'awful', 'wife'], tree) is False

python - いくつかのワイルドカードを含む大きなリストでのメンバーシップ テスト

3 に答える 3

Related

Reference

python - いくつかのワイルドカードを含む大きなリストでのメンバーシップテスト