python - nltkを使用したpython対称単語行列

Question

テキストドキュメントから対称単語行列を作成しようとしています。

例: text = "Barbara is good. Barbara is friends with Benny. Benny is bad."

nltk を使用してテキストドキュメントをトークン化しました。次に、同じ文に他の単語が何回出現するかを数えたいと思います。上記のテキストから、以下のマトリックスを作成したいと思います。

        Barbara good    friends Benny   bad
Barbara 2   1   1   1   0
good    1   1   0   0   0
friends 1   0   1   1   0
Benny   1   0   1   2   1
bad     0   0   1   1   1

対角線は単語の頻度であることに注意してください。Barbara は、Barbaras の数と同じくらい頻繁に Barbara と一緒に文に表示されるためです。数えすぎないことを願っていますが、コードが複雑になりすぎても大きな問題にはなりません。

score 7 · Accepted Answer

まず、テキストをトークン化し、各センテンスを反復処理し、各センテンス内の単語のすべてのペアごとの組み合わせを反復処理し、ネストされたにカウントを格納しますdict。

from nltk.tokenize import word_tokenize, sent_tokenize
from collections import defaultdict
import numpy as np
text = "Barbara is good. Barbara is friends with Benny. Benny is bad."

sparse_matrix = defaultdict(lambda: defaultdict(lambda: 0))

for sent in sent_tokenize(text):
    words = word_tokenize(sent)
    for word1 in words:
        for word2 in words:
            sparse_matrix[word1][word2]+=1

print sparse_matrix
>> defaultdict(<function <lambda> at 0x7f46bc3587d0>, {
'good': defaultdict(<function <lambda> at 0x3504320>, 
    {'is': 1, 'good': 1, 'Barbara': 1, '.': 1}), 
'friends': defaultdict(<function <lambda> at 0x3504410>, 
    {'friends': 1, 'is': 1, 'Benny': 1, '.': 1, 'Barbara': 1, 'with': 1}), etc..

これは基本的には行列に似ています。インデックスを作成sparse_matrix['good']['Barbara']して number を取得したり1、インデックスsparse_matrix['bad']['Barbara']を作成して getを取得したりできます0が、実際には、共起したことのない単語のカウントを保存するわけではありません。それ。これにより、このようなことを行うときに、実際に多くのメモリを節約できます。ある種の線形代数またはその他の計算上の理由で密行列が必要な場合は、次のように取得できます。0defaultdict

lexicon_size=len(sparse_matrix)
def mod_hash(x, m):
    return hash(x) % m
dense_matrix = np.zeros((lexicon_size, lexicon_size))

for k in sparse_matrix.iterkeys():
    for k2 in sparse_matrix[k].iterkeys():
        dense_matrix[mod_hash(k, lexicon_size)][mod_hash(k2, lexicon_size)] = \
            sparse_matrix[k][k2]

print dense_matrix
>>
[[ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  1.  0.  0.  1.]
 [ 0.  0.  1.  1.  1.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  2.  0.  2.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  1.  2.  0.  3.]]

行列のスパース性を処理する他の方法については、 http://docs.scipy.org/doc/scipy/reference/sparse.htmlを参照することをお勧めします。

score 3 · Accepted Answer

私は最初に次のようなものを設定します。ある種のトークン化を追加する可能性があります。あなたの例では何も必要ありませんでしたが。

text = """Barbara is good. Barbara is friends with Benny. Benny is bad."""
allwords = text.replace('.','').split(' ')
word_to_index = {}
index_to_word = {}
index = 0
for word in allwords:
    if word not in word_to_index:
         word_to_index[word] = index
         index_to_word[index] = word
         index += 1
word_count = index

>>> index_to_word
{0: 'Barbara',
 1: 'is',
 2: 'good',
 3: 'friends',
 4: 'with',
 5: 'Benny',
 6: 'bad'}

>>> word_to_index
{'Barbara': 0,
 'Benny': 5,
 'bad': 6,
 'friends': 3,
 'good': 2,
 'is': 1,
 'with': 4}

次に、適切なサイズ (word_count x word_count) の行列を宣言します。おそらくnumpylikeを使用して

import numpy
matrix = numpy.zeros((word_count, word_count))

または単にネストされたリスト:

matrix = [None,]*word_count
for i in range(word_count):
    matrix[i] = [0,]*word_count

これはトリッキーでmatrix = [[0]*word_count]*word_count、同じ内部配列への 7 つの参照を含むリストを作成するようなものは機能しないことに注意してください (たとえば、そのコードを試してからを実行すると、、なども 1 に変更されます) matrix[0][1] = 1。）。matrix[1][1]matrix[2][1]

次に、文を繰り返すだけです。

sentences = text.split('.')
for sent in sentences:
   for word1 in sent.split(' '):
       if word1 not in word_to_index:
           continue
       for word2 in sent.split(' '):
           if word2 not in word_to_index:
               continue
           matrix[word_to_index[word1]][word_to_index[word2]] += 1

次に、次のようになります。

>>> matrix

[[2, 2, 1, 1, 1, 1, 0],
 [2, 3, 1, 1, 1, 2, 1],
 [1, 1, 1, 0, 0, 0, 0],
 [1, 1, 0, 1, 1, 1, 0],
 [1, 1, 0, 1, 1, 1, 0],
 [1, 2, 0, 1, 1, 2, 1],
 [0, 1, 0, 0, 0, 1, 1]]

または、'Benny' と 'bad' の頻度について知りたい場合は、matrix[word_to_index['Benny']][word_to_index['bad']].

python - nltkを使用したpython対称単語行列

2 に答える 2

Related

Reference