python - Python を使用してファイル内のバイグラム (2 つの単語のペア) をカウントする

Question

Python を使用して、ファイル内のすべてのバイグラム (隣接する単語のペア) の出現回数をカウントしたいと考えています。ここでは、非常に大きなファイルを扱っているので、効率的な方法を探しています。ファイルの内容に対して正規表現 "\w+\s\w+" を使用して count メソッドを使用しようとしましたが、効率的であることがわかりませんでした。

たとえば、次の内容を持つファイル a.txt からバイグラムの数を数えたいとしましょう:

"the quick person did not realize his speed and the quick person bumped "

上記のファイルでは、バイグラムセットとそのカウントは次のようになります。

(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1

ユニグラム (単一の単語) をカウントするために使用される Python の Counter オブジェクトの例に出くわしました。また、正規表現アプローチも使用します。

例は次のようになります。

>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)

上記のコードの出力は次のとおりです。

[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
 ('realize', 1),  ('his', 1), ('speed', 1), ('bumped', 1)]

Counter オブジェクトを使用してバイグラムの数を取得できるかどうか疑問に思っていました。Counter オブジェクトまたは正規表現以外のアプローチも高く評価されます。

score 53 · Accepted Answer

いくつかのitertools魔法：

>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("\w+", 
   "the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))

出力：

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, 
  ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, 
  ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, 
  ('realize', 'his'): 1})

ボーナス

任意の n グラムの頻度を取得します。

from itertools import tee, islice

def ngrams(lst, n):
  tlst = lst
  while True:
    a, b = tee(tlst)
    l = tuple(islice(a, n))
    if len(l) == n:
      yield l
      next(b)
      tlst = b
    else:
      break

>>> Counter(ngrams(words, 3))

出力：

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, 
  ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, 
  ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, 
  ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, 
  ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

これは、遅延イテラブルとジェネレーターでも機能します。そのため、ファイルを 1 行ずつ読み取り、単語を生成するジェネレーターを作成し、それを渡してngarmsメモリ内のファイル全体を読み取ることなく、遅延して消費することができます。

score 14 · Accepted Answer

どうzip()ですか？

import re
from collections import Counter
words = re.findall('\w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))

score 5 · Accepted Answer

Counter次のように、任意の n_gram に単純に使用できます。

from collections import Counter
from nltk.util import ngrams 

text = "the quick person did not realize his speed and the quick person bumped "
n_gram = 2
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the'): 1,
         ('did', 'not'): 1,
         ('his', 'speed'): 1,
         ('not', 'realize'): 1,
         ('person', 'bumped'): 1,
         ('person', 'did'): 1,
         ('quick', 'person'): 2,
         ('realize', 'his'): 1,
         ('speed', 'and'): 1,
         ('the', 'quick'): 2})

3 グラムの場合は、n_gramを 3 に変更します。

n_gram = 3
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the', 'quick'): 1,
         ('did', 'not', 'realize'): 1,
         ('his', 'speed', 'and'): 1,
         ('not', 'realize', 'his'): 1,
         ('person', 'did', 'not'): 1,
         ('quick', 'person', 'bumped'): 1,
         ('quick', 'person', 'did'): 1,
         ('realize', 'his', 'speed'): 1,
         ('speed', 'and', 'the'): 1,
         ('the', 'quick', 'person'): 2})

score 1 · Accepted Answer

この質問をしてから、長い時間が経ちました。回答を参考にして、独自のソリューションを作成しています。私はそれを共有したいと思います:

    import regex
    bigrams_tst = regex.findall(r"\b\w+\s\w+", open(myfile).read(), overlapped=True)

これにより、句読点で中断されないすべてのバイグラムが提供されます。

score 0 · Accepted Answer

scikit-learn ( ) のCountVectorizerを使用して、バイグラム(より一般的には任意の ngram) を生成できます。pip install sklearn

例 (Python 3.6.7 および scikit-learn 0.24.2 でテスト済み)。

import sklearn.feature_extraction.text

ngram_size = 2
train_set = ['the quick person did not realize his speed and the quick person bumped']

vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vectorizer.fit(train_set) # build ngram dictionary
ngram = vectorizer.transform(train_set) # get ngram
print('ngram: {0}\n'.format(ngram))
print('ngram.shape: {0}'.format(ngram.shape))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))

出力：

>>> print('ngram: {0}\n'.format(ngram)) # Shows the bi-gram count
ngram:   (0, 0) 1
  (0, 1)        1
  (0, 2)        1
  (0, 3)        1
  (0, 4)        1
  (0, 5)        1
  (0, 6)        2
  (0, 7)        1
  (0, 8)        1
  (0, 9)        2

>>> print('ngram.shape: {0}'.format(ngram.shape))
ngram.shape: (1, 10)
>>> print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
vectorizer.vocabulary_: {'the quick': 9, 'quick person': 6, 'person did': 5, 'did not': 1, 
'not realize': 3, 'realize his': 7, 'his speed': 2, 'speed and': 8, 'and the': 0, 
'person bumped': 4}

python - Python を使用してファイル内のバイグラム (2 つの単語のペア) をカウントする

6 に答える 6

Related

Reference