python - .txt ファイルで最も頻繁に使用される単語を見つける Python プログラム, 単語とその数を出力する必要があります

Question

今のところ、countChars 関数を置き換える関数がありますが、

def countWords(lines):
  wordDict = {}
  for line in lines:
    wordList = lines.split()
    for word in wordList:
      if word in wordDict: wordDict[word] += 1
      else: wordDict[word] = 1
  return wordDict

しかし、私がプログラムを実行すると、この忌まわしきものを吐き出します (これは単なる例です。その隣に膨大な数の単語が約 2 ページあります)。

before 1478
battle-field 1478
as 1478
any 1478
altogether 1478
all 1478
ago 1478
advanced. 1478
add 1478
above 1478

明らかに、これはコードが実行するのに十分健全であることを意味しますが、私はそれから欲しいものを得ていません. 各単語がファイル (ゲティスバーグのアドレスである gb.txt) に含まれる回数を出力する必要があります。明らかに、ファイルに含まれる各単語が正確に 1478 回含まれているわけではありません。

私はプログラミングに関してはかなり新しいので、ちょっと困惑しています..

from __future__ import division

inputFileName = 'gb.txt'

def readfile(fname):
  f = open(fname, 'r')
  s = f.read()
  f.close()
 return s.lower()

def countChars(t):
  charDict = {}
  for char in t:
    if char in charDict: charDict[char] += 1
    else: charDict[char] = 1
  return charDict

def findMostCommon(charDict):
  mostFreq = ''
  mostFreqCount = 0
  for k in charDict:
    if charDict[k] > mostFreqCount:
      mostFreqCount = charDict[k]
      mostFreq = k
  return mostFreq

def printCounts(charDict):
  for k in charDict:
    #First, handle some chars that don't show up very well when they print
    if k == '\n': print '\\n', charDict[k]  #newline
    elif k == ' ': print 'space', charDict[k]
    elif k == '\t': print '\\t', charDict[k] #tab
    else: print k, charDict[k]  #Normal character - print it with its count

def printAlphabetically(charDict):
  keyList = charDict.keys()
  keyList.sort()
  for k in keyList:
    #First, handle some chars that don't show up very well when they print
    if k == '\n': print '\\n', charDict[k]  #newline
    elif k == ' ': print 'space', charDict[k]
    elif k == '\t': print '\\t', charDict[k] #tab
    else: print k, charDict[k]  #Normal character - print it with its count

def printByFreq(charDict):
  aList = []
  for k in charDict:
    aList.append([charDict[k], k])
  aList.sort()     #Sort into ascending order
  aList.reverse()  #Put in descending order
  for item in aList:
    #First, handle some chars that don't show up very well when they print
    if item[1] == '\n': print '\\n', item[0]  #newline
    elif item[1] == ' ': print 'space', item[0]
    elif item[1] == '\t': print '\\t', item[0] #tab
    else: print item[1], item[0]  #Normal character - print it with its count

def main():
  text = readfile(inputFileName)
  charCounts = countChars(text)
  mostCommon = findMostCommon(charCounts)
  #print mostCommon + ':', charCounts[mostCommon]
  #printCounts(charCounts)
  #printAlphabetically(charCounts)
  printByFreq(charCounts)

main()

score 25 · Accepted Answer

パッセージ内の単語数を数える必要がある場合は、正規表現を使用することをお勧めします。

簡単な例から始めましょう：

import re

my_string = "Wow! Is this true? Really!?!? This is crazy!"

words = re.findall(r'\w+', my_string) #This finds words in the document

結果：

>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']

「Is」と「is」は2つの異なる単語であることに注意してください。私の推測では、それらを同じように数えたいので、すべての単語を大文字にしてから数えることができます。

from collections import Counter

cap_words = [word.upper() for word in words] #capitalizes all the words

word_counts = Counter(cap_words) #counts the number each time a word appears

結果：

>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})

ここまで元気ですか？

今度は、ファイルを読み取っているときに、上記で行ったのとまったく同じことを行う必要があります。

import re
from collections import Counter

with open('your_file.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word.upper() for word in words]

word_counts = Counter(cap_words)

score 17 · Accepted Answer

強力なツールを自由に使用できる場合、このプログラムは実際には4ライナーです。

with open(yourfile) as f:
    text = f.read()

words = re.compile(r"[\w']+", re.U).findall(text)   # re.U == re.UNICODE
counts = collections.Counter(words)

正規表現は、隣接する句読点に関係なく、すべての単語を検索します（ただし、単語の一部としてアポストロフィをカウントします）。

カウンターはほとんど辞書のように機能しますがcounts.most_common(10)、、、カウントの追加などを行うことができます。を参照してください。help(Counter)

printBy...また、副作用のない関数だけが再利用しやすいので、関数を作成しないことをお勧めします。

def countsSortedAlphabetically(counter, **kw):
    return sorted(counter.items(), **kw)

#def countsSortedNumerically(counter, **kw):
#    return sorted(counter.items(), key=lambda x:x[1], **kw)
#### use counter.most_common(n) instead

# `from pprint import pprint as pp` is also useful
def printByLine(tuples):
    print( '\n'.join(' '.join(map(str,t)) for t in tuples) )

デモ：

>>> words = Counter(['test','is','a','test'])
>>> printByLine( countsSortedAlphabetically(words, reverse=True) )
test 2
is 1
a 1

Mateusz Koniecznyのコメントに対応するように編集：[a-zA-Z']を[\w']に置き換え...Pythonのドキュメントによると、文字クラス\ w、「Unicodeの単語文字に一致します。これには、任意の言語の単語の一部、および数字とアンダースコア。ASCIIフラグが使用されている場合、[a-zA-Z0-9_]のみが一致します。」（...しかし、明らかにアポストロフィとは一致しません...）ただし、\ wには_と0-9が含まれるため、これらが不要で、Unicodeを使用していない場合は、[a-zA -Z']; Unicodeを使用している場合は、負のアサーションなどを実行して、\w文字クラスから[0-9_]を減算する必要があります。

score 3 · Accepted Answer

~~必要な場所に単純なタイプミスがwordsありますword。~~

~~編集：ソースを編集したようです。コピーアンドペーストを使用して、最初に正しく取得してください。~~

編集2：タイプミスを起こしやすいのはあなただけではないようです。本当の問題は、あなたがlines望む場所にいるということですline。ソースを編集したことをお詫び申し上げます。

score 2 · Accepted Answer

ninjageckoほどエレガントではありませんが、可能な解決策は次のとおりです。

from collections import defaultdict

dicto = defaultdict(int)

with open('yourfile.txt') as f:
    for line in f:
        s_line = line.rstrip().split(',') #assuming ',' is the delimiter
        for ele in s_line:
            dicto[ele] += 1

 #dicto contians words as keys, word counts as values

 for k,v in dicto.iteritems():
     print k,v

python - .txt ファイルで最も頻繁に使用される単語を見つける Python プログラム, 単語とその数を出力する必要があります

6 に答える 6

Related

Reference