python - 関連する単語の確率カウント/頻度?

Question

共通のルートワード/意味を共有する単一の単語の数値確率値を生成する方法を探しています。

ユーザーは、「ダンサー」、「踊る」、「踊る」などの単語を使用してコンテンツを生成します。

「dancer」が 30 回送信され、dancing が 5 回送信された場合、それらすべてをキャッチする値「dance:35」が 1 つだけ必要です。

ただし、ユーザーが「一致」などの単語も送信した場合、「ダンス」のカウントには影響しないはずですが、代わりに、「一致」や「一致」などの単語と一緒に個別のカウントに追加されます。

また、事前に定義されたルートワードのリストも用意していません。そのユーザー生成コンテンツに基づいて動的に作成する必要があります。

だから私の質問は、おそらくこれをやってのけるための最良の方法は何ですか? 完璧な解決策はないと確信していますが、ここにいる誰かがおそらく私よりも良い方法を思いつくことができると思います.

これまでの私の考えでは、最も意味のある単語は少なくとも 3 文字か 4 文字であると想定しています。したがって、遭遇する長さが 4 を超えるすべての単語について、4 に切り詰め (「ダンサー」は「ダンク」になります)、単語のリストをチェックして、以前に遭遇したことがあるかどうかを確認します。そうでない場合は、そのリストに追加して繰り返します。

ここにも同様の質問がいくつかあるようです。しかし、ルートを考慮した答えが見つからず、Pythonで実装できます。答えはどちらか一方のようです。

score 3 · Accepted Answer

Java ライブラリ用の Python ラッパーは必要ありません。nltk には Snowball があります。:)

>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'

ステミングは常に正確なルーツを与えるとは限りませんが、それは素晴らしい出発点です。

以下は、ステムの使用例です。stem: (word, count)語幹ごとにできるだけ短い言葉を選びながら、の辞書を作っています。So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}

コード例: (テキストはhttp://en.wikipedia.org/wiki/Danceから引用)

import re
from nltk.stem import SnowballStemmer as SS

text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]

stemmer = SS('english')
counts = dict()

#count stems and extract shortest words possible
for word in words:
    stem = stemmer.stem(word)
    if stem in counts:
        shortest,count = counts[stem]
        if len(word) < len(shortest):
            shortest = word
        counts[stem] = (shortest,count+1)
    else:
        counts[stem]=(word,1)

#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
    print '%s:%d (Root: %s)' % item

出力:

dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---

特定のニーズに合わせて見出し語化することはお勧めしません。

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'

部分文字列は、常に特定の時点で失敗し、多くの場合惨めに失敗するため、お勧めできません。

固定長: 疑似単語「dancitization」と「dancendence」は、それぞれ 4 文字と 5 文字で「dance」と一致します。
比率: 比率が低いと偽物が返されます (上記のように)
ratio: 比率が高いと十分に一致しません (例: 'running')

しかし、ステミングを使用すると、次のようになります。

>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'

これは、ステム「danc」の印象的な一致しない結果です。「dancer」が「danc」に語幹を変えないことを考慮しても、全体的に精度はかなり高いです。

これが開始に役立つことを願っています。

score 3 · Accepted Answer

探しているものは、単語の語幹(言語的な「根」よりも技術的な観点) としても知られています。完全な解決策はないと仮定するのは正しいです。すべてのアプローチは、分析が不完全であるか、カバレッジが不足しています。基本的に最善の方法は、語幹を含む単語リストを使用するか、語幹アルゴリズムを使用することです。ここで最初の回答を確認して、Python ベースのソリューションを入手してください。

単語のステミングまたはレンマタイゼーションを行うにはどうすればよいですか?

私はすべての Java ベースのプロジェクトで Snowball を使用していますが、私の目的には完璧に動作します (非常に高速で、幅広い言語をカバーしています)。Pythonラッパーもあるようです：

http://snowball.tartarus.org/download.php

python - 関連する単語の確率カウント/頻度?

2 に答える 2

Related

Reference