python - NLTK コーパスを使用したスペイン語の単語のタグ付け

Question

NLTK を使用してスペイン語の単語にタグを付ける方法を学ぼうとしています。

nltk bookから、例を使用して英語の単語にタグを付けるのは非常に簡単です。私は nltk とすべての言語処理に慣れていないため、続行する方法について非常に混乱しています。

cess_espコーパスをダウンロードしました。でコーパスを指定する方法はありますかnltk.pos_tag。私はドキュメントを見ましたが、pos_tag私ができることを示唆するものは何も見当たりませんでした。いくつかの重要な概念が欠けているように感じます。cess_esp コーパスに対してテキスト内の単語に手動でタグを付ける必要がありますか? （手動で文をトークン化し、それをコーパスに対して実行することを意味します）または、私は完全に的外れですか。ありがとうございました

score 19 · Accepted Answer

まず、コーパスからタグ付きの文を読み取る必要があります。NLTK は、さまざまなコーパスからのさまざまな形式に煩わされることのない優れたインターフェイスを提供します。コーパスをインポートするだけで、コーパスオブジェクト関数を使用してデータにアクセスできます。http://nltk.googlecode.com/svn/trunk/nltk_data/index.xmlを参照してください。

次に、選択したタガーを選択し、タガーをトレーニングする必要があります。もっと派手なオプションがありますが、N-gram タガーから始めることができます。

次に、タガーを使用して、必要な文にタグを付けることができます。コード例を次に示します。

from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

# Read the corpus into a list, 
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()

# Train the unigram tagger
uni_tag = ut(cess_sents)

sentence = "Hola , esta foo bar ."

# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))

# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%

# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])

# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])

# Using the tagger.
bi_tag.tag(sentence.split(" "))

大規模なコーパスでタガーをトレーニングするには、かなりの時間がかかる場合があります。タガーが必要になるたびにトレーニングする代わりに、トレーニング済みのタガーを後で再利用できるようにファイルに保存すると便利です。

http://nltk.googlecode.com/svn/trunk/doc/book/ch05.htmlの「タグの保存」セクションをご覧ください。

score 7 · Accepted Answer

前の回答のチュートリアルを踏まえて、スパゲッティタガーからのよりオブジェクト指向のアプローチを次に示します: https://github.com/alvations/spaghetti-tagger

#-*- coding: utf8 -*-

from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from cPickle import dump,load

def loadtagger(taggerfilename):
    infile = open(taggerfilename,'rb')
    tagger = load(infile); infile.close()
    return tagger

def traintag(corpusname, corpus):
    # Function to save tagger.
    def savetagger(tagfilename,tagger):
        outfile = open(tagfilename, 'wb')
        dump(tagger,outfile,-1); outfile.close()
        return
    # Training UnigramTagger.
    uni_tag = ut(corpus)
    savetagger(corpusname+'_unigram.tagger',uni_tag)
    # Training BigramTagger.
    bi_tag = bt(corpus)
    savetagger(corpusname+'_bigram.tagger',bi_tag)
    print "Tagger trained with",corpusname,"using" +\
                "UnigramTagger and BigramTagger."
    return

# Function to unchunk corpus.
def unchunk(corpus):
    nomwe_corpus = []
    for i in corpus:
        nomwe = " ".join([j[0].replace("_"," ") for j in i])
        nomwe_corpus.append(nomwe.split())
    return nomwe_corpus

class cesstag():
    def __init__(self,mwe=True):
        self.mwe = mwe
        # Train tagger if it's used for the first time.
        try:
            loadtagger('cess_unigram.tagger').tag(['estoy'])
            loadtagger('cess_bigram.tagger').tag(['estoy'])
        except IOError:
            print "*** First-time use of cess tagger ***"
            print "Training tagger ..."
            from nltk.corpus import cess_esp as cess
            cess_sents = cess.tagged_sents()
            traintag('cess',cess_sents)
            # Trains the tagger with no MWE.
            cess_nomwe = unchunk(cess.tagged_sents())
            tagged_cess_nomwe = batch_pos_tag(cess_nomwe)
            traintag('cess_nomwe',tagged_cess_nomwe)
            print
        # Load tagger.
        if self.mwe == True:
            self.uni = loadtagger('cess_unigram.tagger')
            self.bi = loadtagger('cess_bigram.tagger')
        elif self.mwe == False:
            self.uni = loadtagger('cess_nomwe_unigram.tagger')
            self.bi = loadtagger('cess_nomwe_bigram.tagger')

def pos_tag(tokens, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.tag(tokens)

def batch_pos_tag(sentences, mmwe=True):
    tagger = cesstag(mmwe)
    return tagger.uni.batch_tag(sentences)

tagger = cesstag()
print tagger.uni.tag('Mi colega me ayuda a programar cosas .'.split())

score 1 · Accepted Answer

次のスクリプトは、スペイン語の文章で「単語の袋」を取得するための簡単な方法を提供します。正しく行いたい場合は、タグの前に文をトークン化する必要があることに注意してください。'religiosas','.' の 2 つのトークンで区切る必要があります。

#-*- coding: utf8 -*-

# about the tagger: http://nlp.stanford.edu/software/tagger.shtml 
# about the tagset: nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html

import nltk

from nltk.tag.stanford import POSTagger

spanish_postagger = POSTagger('models/spanish.tagger', 'stanford-postagger.jar', encoding='utf8')

sentences = ['El copal se usa principalmente para sahumar en distintas ocasiones como lo son las fiestas religiosas.','Las flores, hojas y frutos se usan para aliviar la tos y también se emplea como sedante.']

for sent in sentences:

    words = sent.split()
    tagged_words = spanish_postagger.tag(words)

    nouns = []

    for (word, tag) in tagged_words:

        print(word+' '+tag).encode('utf8')
        if isNoun(tag): nouns.append(word)

    print(nouns)

与えます:

El da0000
copal nc0s000
se p0000000
usa vmip000
principalmente rg
para sp000
sahumar vmn0000
en sp000
distintas di0000
ocasiones nc0p000
como cs
lo pp000000
son vsip000
las da0000
fiestas nc0p000
religiosas. np00000
[u'copal', u'ocasiones', u'fiestas', u'religiosas.']
Las da0000
flores, np00000
hojas nc0p000
y cc
frutos nc0p000
se p0000000
usan vmip000
para sp000
aliviar vmn0000
la da0000
tos nc0s000
y cc
también rg
se p0000000
emplea vmip000
como cs
sedante. nc0s000
[u'flores,', u'hojas', u'frutos', u'tos', u'sedante.']

python - NLTK コーパスを使用したスペイン語の単語のタグ付け

4 に答える 4

Related

Reference