python - nltk カスタムトークナイザーとタガー

Question

これが私の要件です。次のことを実現できるように、段落をトークン化してタグ付けしたいと考えています。

段落内の日付と時刻を識別し、日付と時刻としてタグ付けする必要があります
段落内の既知のフレーズを特定し、それらを CUSTOM としてタグ付けする必要があります
また、残りのコンテンツはトークン化する必要があります。デフォルトの nltk の word_tokenize および pos_tag 関数でトークン化する必要がありますか?

たとえば、次の文

"They all like to go there on 5th November 2010, but I am not interested."

そのカスタムフレーズが「I am not interested」の場合は、次のようにタグ付けしてトークン化する必要があります。

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

どんな提案も役に立ちます。

score 7 · Accepted Answer

適切な答えは、必要な方法でタグ付けされた大規模なデータセットをコンパイルし、機械学習チャンカーをトレーニングすることです。時間がかかりすぎる場合は、POS タガーを実行し、正規表現を使用してその出力を後処理するのが簡単な方法です。最長の一致を取得することは、ここで難しい部分です。

s = "They all like to go there on 5th November 2010, but I am not interested."

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$')

def custom_tagger(sentence):
    tagged = pos_tag(word_tokenize(sentence))
    phrase = []
    date_found = False

    i = 0
    while i < len(tagged):
        (w,t) = tagged[i]
        phrase.append(w)
        in_date = DATE.match(' '.join(phrase))
        date_found |= bool(in_date)
        if date_found and not in_date:          # end of date found
            yield (' '.join(phrase[:-1]), 'DATE')
            phrase = []
            date_found = False
        elif date_found and i == len(tagged)-1:    # end of date found
            yield (' '.join(phrase), 'DATE')
            return
        else:
            i += 1
            if not in_date:
                yield (w,t)
                phrase = []

Todo: DATEre を拡張し、コードを挿入してフレーズを検索しCUSTOM、POS タグとトークンを照合してこれをより洗練させ5th、それ自体で日付としてカウントするかどうかを決定します。(おそらくそうではないので、序数のみを含む長さ 1 の日付を除外します。)

score 2 · Accepted Answer

目的を達成するには、おそらく nltk.RegexpParser でチャンクを行う必要があります。

参照: http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

python - nltk カスタムトークナイザーとタガー

2 に答える 2

Related

Reference