python - テキストを解析して適切な名詞 (名前と組織) を取得する - python nltk

Question

私はsmsのようなテキストの非常に小さなチャンクから名前と組織名のように固有名詞を抽出しようとしています.nltkで利用可能な基本的なパーサーは、NLTK WordNetを使用して固有名詞を見つけることで名詞を取得できますが、問題は固有名詞を取得するときです大文字で始まらない、このようなテキストでは、sumit のような名前は固有名詞として認識されません

>>> sentence = "i spoke with sumit and rajesh and Samit about the gridlock situation last night @ around 8 pm last nite"
>>> tagged_sent = pos_tag(sentence.split())
>>> print tagged_sent
[('i', 'PRP'), ('spoke', 'VBP'), ('with', 'IN'), **('sumit', 'NN')**, ('and', 'CC'), ('rajesh', 'JJ'), ('and', 'CC'), **('Samit', 'NNP'),** ('about', 'IN'), ('the', 'DT'), ('gridlock', 'NN'), ('situation', 'NN'), ('last', 'JJ'), ('night', 'NN'), ('@', 'IN'), ('around', 'IN'), ('8', 'CD'), ('pm', 'NN'), ('last', 'JJ'), ('nite', 'NN')]

score 9 · Accepted Answer

人や組織の名前を抽出するより良い方法があります

from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(sentence)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos) 

nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]

ただし、Named Entity Recognizerはすべてエラーを犯します。適切な名前を本当に見逃したくない場合は、適切な名前の辞書を使用して、名前が辞書に含まれているかどうかを確認できます。

score 0 · Accepted Answer

try this code

def get_entities(self,args):
    qry = "who is Mahatma Gandhi"
    tokens = nltk.tokenize.word_tokenize(qry)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    print sentt
    person = []
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leave in subtree.leaves():
            person.append(leave)
    print "person=", person

You can get names of person, organization, locations with the help of this ne_chunk() function. Hope it helps. Thankz

python - テキストを解析して適切な名詞 (名前と組織) を取得する - python nltk

3 に答える 3

Related

Reference