python - Python を使用した N グラムの計算

Question

次のようなテキストを含むテキストファイルの Unigram、BiGrams、および Trigram を計算する必要がありました。

「嚢胞性線維症は、米国だけで 30,000 人の子供と若年成人に影響を与えています。塩水のミストを吸入すると、嚢胞性線維症患者の気道を満たす膿と感染症を減らすことができますが、副作用には不快な咳の発作と不快な味が含まれます。それが結論です。ニューイングランド・ジャーナル・オブ・メディスンの今週号に掲載された 2 つの研究のうちの 1 つです。」

Python で開始し、次のコードを使用しました。

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

しかし、CYSTIC と FIBROSIS または CYSTIC FIBROSIS のように単語の間から必要な場合は、単語内のすべての n-gram で機能します。どうすればこれを行うことができるかについて、誰かが私を助けることができますか?

score 40 · Accepted Answer

次の関数を使用できるように、入力がスペースで区切られた単語を含む文字列であると仮定しますx = "a b c d"（編集：おそらくより完全な解決策については最後の関数を参照してください）：

def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

それらを結合して文字列に戻したい場合は、次のように呼び出します。

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

最後に、これは合計に要約されないため、入力が'a a a a'である場合は、それらを dict にカウントアップする必要があります。

for g in (' '.join(x) for x in ngrams(input, 2)):
    grams.setdefault(g, 0)
    grams[g] += 1

すべてを 1 つの最終関数にまとめると、次のようになります。

def ngrams(input, n):
   input = input.split(' ')
   output = {}
   for i in range(len(input)-n+1):
       g = ' '.join(input[i:i+n])
       output.setdefault(g, 0)
       output[g] += 1
    return output

ngrams('a a a a', 2) # {'a a': 3}

score 27 · Accepted Answer

NLTK (自然言語ツールキット) を使用し、関数を使用してテキストをトークン化 (分割) し、バイグラムとトライグラムを見つけます。

import nltk
words = nltk.word_tokenize(my_text)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)

score 11 · Accepted Answer

Python には、Scikit という興味深いモジュールがもう 1 つあります。これがコードです。これは、特定の範囲で指定されたすべてのグラムを取得するのに役立ちます。ここにコードがあります

from sklearn.feature_extraction.text import CountVectorizer 
text = "this is a foo bar sentences and i want to ngramize it"
vectorizer = CountVectorizer(ngram_range=(1,6))
analyzer = vectorizer.build_analyzer()
print analyzer(text)

出力は

[u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']

ここでは、1 から 6 の範囲で指定されたすべてのグラムを示します。countVectorizer というメソッドを使用します。これがそのリンクです。

score 3 · Accepted Answer

使用collections.deque:

from collections import deque
from itertools import islice

def ngrams(message, n=1):
    it = iter(message.split())
    window = deque(islice(it, n), maxlen=n)
    yield tuple(window)
    for item in it:
        window.append(item)
        yield tuple(window)

...または、リスト内包表記として 1 行で実行することもできます。

n = 2
message = "Hello, how are you?".split()
myNgrams = [message[i:i+n] for i in range(len(message) - n)]

score 2 · Accepted Answer

nltk は ngram をネイティブでサポートしています

'n' は ngram サイズです。例: n=3 はトライグラム用です。

from nltk import ngrams

def ngramize(texts, n):
    output=[]
    for text in texts:
        output += ngrams(text,n)
    return output

score 1 · Accepted Answer

投稿は古いですが、ngrams 作成ロジックのほとんどを 1 つの投稿にまとめられるように、ここで私の答えを言及することを考えました。

Python には TextBlob という名前のものがあります。NLTK に似た ngram を非常に簡単に作成します。

以下は、理解を容易にするための出力を含むコードスニペットです。

sent = """This is to show the usage of Text Blob in Python"""
blob = TextBlob(sent)
unigrams = blob.ngrams(n=1)
bigrams = blob.ngrams(n=2)
trigrams = blob.ngrams(n=3)

出力は次のとおりです。

unigrams
[WordList(['This']),
 WordList(['is']),
 WordList(['to']),
 WordList(['show']),
 WordList(['the']),
 WordList(['usage']),
 WordList(['of']),
 WordList(['Text']),
 WordList(['Blob']),
 WordList(['in']),
 WordList(['Python'])]

bigrams
[WordList(['This', 'is']),
 WordList(['is', 'to']),
 WordList(['to', 'show']),
 WordList(['show', 'the']),
 WordList(['the', 'usage']),
 WordList(['usage', 'of']),
 WordList(['of', 'Text']),
 WordList(['Text', 'Blob']),
 WordList(['Blob', 'in']),
 WordList(['in', 'Python'])]

trigrams
[WordList(['This', 'is', 'to']),
 WordList(['is', 'to', 'show']),
 WordList(['to', 'show', 'the']),
 WordList(['show', 'the', 'usage']),
 WordList(['the', 'usage', 'of']),
 WordList(['usage', 'of', 'Text']),
 WordList(['of', 'Text', 'Blob']),
 WordList(['Text', 'Blob', 'in']),
 WordList(['Blob', 'in', 'Python'])]

それと同じくらい簡単です。

TextBlob によって行われていることは他にもあります。詳細については、このドキュメントをご覧ください - https://textblob.readthedocs.io/en/dev/

python - Python を使用した N グラムの計算

8 に答える 8

Related

Reference