python - 単語を検索、カウント、保存する方法は?

Question

特定の単語を識別して数えようとしています。識別子ごとにカウントを保存する必要があります。

例えば、

リスクリスク無リスク金利

アスタリスクリスクリスク

市場リスクリスク [リスク

*ドキュメントに上記の単語が含まれており、アスタリスクではなく「リスク」をカウントする必要があります。[リスクを「リスク」としてカウントする必要もあります。これが私がこれまでに持っているものです。ただし、リスクだけでなく、アスタリスクと [リスクのカウントも返します。アスタリスクのカウントは必要ありませんが、[risk. 正規表現を使用しようとしましたが、エラーが発生し続けます。さらに、私はPythonの初心者です。誰かが何か考えがあれば、私を助けてください!!^^ ありがとう。

from collections import defaultdict
word_dict = defaultdict(int)

for line in mylist:
words = line.lower().split()
for word in words:
    word_dict[word]+=1

for word in word_dict:
if 'risk' in word:
    word, word_dict[word]

score 2 · Accepted Answer

正規表現をもう一度試してください。'risk'単語境界で囲まれた文字列に一致

import re
re.findall(r'\brisk\b', 'risk risk') ## 2 matches
re.findall(r'\brisk\b', 'risk risk riskrisk') ## 2 matches
re.findall(r'\brisk\b', 'risk risk riskrisk [risk') ## 3 matches
re.findall(r'\brisk\b', 'risk risk riskrisk [risk asterisk') ## 3 matches

score 1 · Accepted Answer

パイプライン化されたアプローチを行います。つまり、単語を辞書に追加する前に、テキストを変換してカウントが正しくなるようにします。

word_dict = {} # empty dictionary

for line in mylist:
    words = line.strip().lower().split() # the strip gets rid of new lines
    for word in words:
        # the strip here will strip away any surrounding punctuation.
        # add any other symbols to the string that you need
        # the key insight here, is you get rid of extra stuff BEFORE inserting
        # into the dictionary
        word_dict[word.strip('[/@#$%')]+=1 

for word in word_dict:
    print word, word_dict[word]

# to just see the count for risk:
print word_dict['risk']

「アスタリスク」という単語をカウントするという事実は、「リスク」という単語をカウントする限り問題ありません。

score 0 · Accepted Answer

riskどの基準が重要で何がそうでないかをもっと厳密に定義する必要があると思います。ただし、次を使用しCounterます。

from collections import Counter
c = Counter()
with open(yourfile) as f:
    for line in f:
        c += Counter(line.split())

この時点で、「リスク」としてカウントするかどうかを判断する関数を作成する必要があります。

def is_risk(word):
    w = word.lower()
    return 'risk' in w and w!='asterisk'

これらのキーに対応する要素を追加するだけです。

sum( c[k] for k in c if is_risk(k) )

score 0 · Accepted Answer

このスニペットを試すことができます：

import shlex

words = shlex.split("risk risk risk free interest rate")
word_count = len([word for word in words if word == "risk" or word =="[risk"])
print word_count

score -2 · Accepted Answer

だからあなたは数えます

'\n' + risk + '\n'
'\n' + risk + ' '
' ' + risk + '\n'
' ' + risk + ' '

python - 単語を検索、カウント、保存する方法は?

5 に答える 5

Related

Reference