python - 文字列内の単語の数を見つける方法は?

Question

文字列 " Hello I am going to I with hello am" があります。文字列に単語が何回出現するかを調べたい。例 hello が 2 回発生します。文字のみを印刷するこのアプローチを試しました-

def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

文字数の求め方を知りたいです。

score 42 · Accepted Answer

個々の単語の数を知りたい場合は、次を使用してcountください。

input_string.count("Hello")

collections.Counterとを使用split()して、すべての単語を集計します。

from collections import Counter

words = input_string.split()
wordCount = Counter(words)

score 6 · Accepted Answer

Counterコレクションからあなたの友達です：

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())

score 5 · Accepted Answer

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

を使用すると、「しない」や「私は」などの短縮形を考慮に入れることができないため、を使用するre.findallよりも汎用性があります。split

デモ（例を使用）：

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

これらのクエリを多数作成することが予想される場合、これは O(N*#queries) の作業ではなく、O(N) の作業を 1 回だけ行います。

score 3 · Accepted Answer

単語の出現回数のベクトルはbag-of-wordsと呼ばれます。

scikit-learn は、それを計算するための優れたモジュールを提供しますsklearn.feature_extraction.text.CountVectorizer。例：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

出力：

2 am
1 going
2 hello
1 to
1 with

コードの一部は、bag-of-words に関するこの Kaggle チュートリアルから取得されました。

参考までに: sklearn の CountVectorizerand() を使用して、句読点を個別のトークンとして含む ngram を取得する方法は?

score 2 · Accepted Answer

大文字と小文字に関係なく、同じ単語として考えるHelloと、次のようになります。hello

>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

score 2 · Accepted Answer

これは、大文字と小文字を区別しない代替アプローチです

sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

文字列とターゲットを小文字に変換してマッチングします。

ps:以下の @DSM によって指摘された"am ham".count("am") == 2問題も処理します :)str.count()

python - 文字列内の単語の数を見つける方法は?

9 に答える 9

Related

Reference