python - 名前に応じてテキストファイルから辞書に単語を追加するにはどうすればよいですか？

Question

だから私はロミオとジュリエットの演劇からの第1幕のスクリプトを含むテキストファイルを持っています、そして私は誰かが単語を言った回数を数えたいです。

ここにテキストがあります：http：//pastebin.com/X0gaxAPK

テキストで話しているのは、グレゴリー、サンプソン、アブラハムの3人です。

基本的に、3人の話者のそれぞれに3つの異なる辞書を作成したいと思います（それが最善の方法である場合）。辞書に人々がそれぞれ言う単語を入力し、スクリプト全体で各単語を何回言ったかを数えます。

どうすればこれを行うことができますか？単語数はわかると思いますが、誰が何を言っているのかを分けて、一人一人の3つの辞書に入れる方法が少しわかりません。

私の出力は次のようになります（これは正しくありませんが、例です）：

Gregory - 
25: the
15: a
5: from
3: while
1: hello
etc

ここで、数字はファイルで述べられている単語の頻度です。

現在、テキストファイルを読み取り、句読点を取り除き、テキストをリストにコンパイルするコードを記述しています。また、外部モジュールは使用したくありません。昔ながらの学習方法を使用したいと思います。ありがとうございます。

正確なコードを投稿する必要はありません。私が何をする必要があるかを説明するだけで、うまくいけばそれを理解できます。私はPython3を使用しています。

score 1 · Accepted Answer

句読点をすぐに削除したくありません。新しい行が前に付いたコロンは、1人の引用がどこで始まりどこで終わるかを示しています。これは重要なので、与えられた引用の単語をどの辞書に追加するかを知っています。おそらく、現在話している人に応じて異なる辞書に追加する、ある種のif-elseが必要になります。

score 1 · Accepted Answer

import collections
import string
c = collections.defaultdict(collections.Counter)
speaker = None

with open('/tmp/spam.txt') as f:
  for line in f:
    if not line.strip():
      # we're on an empty line, the last guy has finished blabbing
      speaker = None
      continue
    if line.count(' ') == 0 and line.strip().endswith(':'):
      # a new guy is talking now, you might want to refine this event
      speaker = line.strip()[:-1]
      continue
    c[speaker].update(x.strip(string.punctuation).lower() for x in line.split())

出力例：

In [1]: run /tmp/spam.py

In [2]: c.keys()
Out[2]: [None, 'Abraham', 'Gregory', 'Sampson']

In [3]: c['Gregory'].most_common(10)
Out[3]: 
[('the', 7),
 ('thou', 6),
 ('to', 6),
 ('of', 4),
 ('and', 4),
 ('art', 3),
 ('is', 3),
 ('it', 3),
 ('no', 3),
 ('i', 3)]

score 1 · Accepted Answer

これは単純な実装です：

from collections import defaultdict

import nltk

def is_dialogue(line):
    # Add more rules to check if the 
    # line is a dialogue or not
    if len(line) > 0 and line.find('[') == -1 and line.find(']') == -1:
        return True

def get_dialogues(filename, people_list):
    dialogues = defaultdict(list)
    people_list = map(lambda x: x+':', people_list)
    current_person = None
    with open(filename) as fin:
        for line in fin:
            current_line = line.strip().replace('\n','')
            if  current_line in people_list:
                current_person = current_line
            if (current_person is not None) and (current_line != current_person) and is_dialogue(current_line):
                dialogues[current_person].append(current_line)
    return dialogues

def get_word_counts(dialogues):
    word_counts = defaultdict(dict)
    for (person, dialogue_list) in dialogues.items():
        word_count = defaultdict(int)
        for dialogue in dialogue_list:
            for word in nltk.tokenize.word_tokenize(dialogue):
                word_count[word] += 1
        word_counts[person] = word_count
    return word_counts

if __name__ == '__main__':
    dialogues = get_dialogues('script.txt', ['Sampson', 'Gregory', 'Abraham'])
    word_counts = get_word_counts(dialogues)
    print word_counts

python - 名前に応じてテキストファイルから辞書に単語を追加するにはどうすればよいですか？

3 に答える 3

Related

Reference