python - タグ付きテキストの Python NLTK コロケーション

Question

これが可能かどうかはわかりませんが、念のため質問してみようと思いました。たとえば、「body | tags」という形式の例のデータセットがあるとします。

"I went to the store and bought some bread" | shopping food

NLTK コロケーションを使用して、データセット内でボディワードとタグワードが共起する回数をカウントする方法があるかどうか疑問に思っています。1 つの例は ("bread","food",598) のようなもので、"bread" はボディワードで、"food" はタグワードで、598 はデータセット内でそれらが共起する回数です。

score 0 · Accepted Answer

NLTK を使用しなくても、次のことができます。

from collections import Counter
from itertools import product

documents = '''"foo bar is not a sentence" | tag1
"bar bar black sheep is not a real sheep" | tag2
"what the bar foo is not a foo bar" | tag1'''

documents = [i.split('|')[0].strip('" ') for i in documents.split('\n')]

collocations = Counter()

for i in documents:
    # Get all the possible word collocations with product
    # NOTE: this includes a token with itself. so we need 
    #       to remove the count for the token with itself.
    x = Counter(list(product(i.split(),i.split()))) \
            - Counter([(i,i) for i in i.split()])
    collocations+=x


for i in collocations:
    print i, collocations[i]

たとえば、文中の同じ単語のコロケーションをカウントする方法の問題に遭遇します。

バーバーブラックシープは本物の羊ではありません

('bar','bar') のコロケーション数は? 2 of 1ですか？上記のコードは 2 を返します。これは、最初の棒が 2 番目の棒と並置され、2 番目の棒が最初の棒と並置されるためです。

python - タグ付きテキストの Python NLTK コロケーション

1 に答える 1

Related

Reference