Java ライブラリ用の Python ラッパーは必要ありません。nltk には Snowball があります。:)
>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'
ステミングは常に正確なルーツを与えるとは限りませんが、それは素晴らしい出発点です。
以下は、ステムの使用例です。stem: (word, count)
語幹ごとにできるだけ短い言葉を選びながら、の辞書を作っています。So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}
コード例: (テキストはhttp://en.wikipedia.org/wiki/Danceから引用)
import re
from nltk.stem import SnowballStemmer as SS
text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]
stemmer = SS('english')
counts = dict()
#count stems and extract shortest words possible
for word in words:
stem = stemmer.stem(word)
if stem in counts:
shortest,count = counts[stem]
if len(word) < len(shortest):
shortest = word
counts[stem] = (shortest,count+1)
else:
counts[stem]=(word,1)
#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
print '%s:%d (Root: %s)' % item
出力:
dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---
特定のニーズに合わせて見出し語化することはお勧めしません。
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'
部分文字列は、常に特定の時点で失敗し、多くの場合惨めに失敗するため、お勧めできません。
- 固定長: 疑似単語「dancitization」と「dancendence」は、それぞれ 4 文字と 5 文字で「dance」と一致します。
- 比率: 比率が低いと偽物が返されます (上記のように)
- ratio: 比率が高いと十分に一致しません (例: 'running')
しかし、ステミングを使用すると、次のようになります。
>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'
これは、ステム「danc」の印象的な一致しない結果です。「dancer」が「danc」に語幹を変えないことを考慮しても、全体的に精度はかなり高いです。
これが開始に役立つことを願っています。