python - Pythonでのステミング、レンマタイゼーション

翻译自：https://stackoverflow.com/questions/24837811 2014-07-19T07:17:59.183

1418 次

他のすべてのトレイルをチェックし、いくつかのソリューションを使用しました. ポートステマーの使用に問題があります。私は接辞を取り除こうとしていますが、ポートステマーは単語をいくつかの奇妙な形式に減らします。

TextBlob を使用している単語を使用して文を検索する必要があります。以下は、使用している私のコードです。リンクからテキストを引っ張ってきました: http://www.nltk.org/book/ch03.html . そして、porterstemmer と wordnetlemmatizer を使用して言語を検索しました。Wordnetlemma は、複数形を単数形に還元するだけです。

url = 'http://www.nltk.org/book/ch03.html'
a = urllib.urlopen(url)
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','Chrome')]
html = br.open(url).read()
raw = nltk.clean_html(html)
tokens = nltk.wordpunct_tokenize(raw)
t = [lmtzr.lemmatize(t) for t in tokens] 
text = nltk.Text(t)
sents = ' '.join([s.lower() for s in Text])
blob = TextBlob(sents)
matches = [str(s) for s in blob.sentences if search_words & set(s.words)]

python - Pythonでのステミング、レンマタイゼーション

0 に答える 0

Related

Reference