python - Pythonで単語頻度をグラフィカルなヒストグラムに変換する

Question

Pavel Anossov のおかげで、これが私が今持っているものです。出力された単語頻度をアスタリスクに変換しようとしています。

import sys
import operator 
from collections import Counter
def candidateWord():


   with open("sample.txt", 'r') as f:
      text = f.read()
   words = [w.strip('!,.?1234567890-=@#$%^&*()_+')for w in text.lower().split()]
            #word_count[words] = word_count.get(words,0) + 1
   counter = Counter(words)

   print("\n".join("{} {}".format(*p) for p in counter.most_common()))

candidateWord()

これは私が現在出力として持っているものです。

how 3

i 2

am 2

are 2

you 2

good 1

hbjkdfd 1

私が試して使用したい式は、最も頻繁に発生する単語が M 回発生し、現在の単語が N 回発生する場合、出力されるアスタリスクの数は次のとおりです。

(50 * N) / M

score 0 · Accepted Answer

単語の整列を避けるために、左にアスタリスクを付けます。

...
counter = Counter(words)
max_freq = counter.most_common()[0][1]
for word, freq in sorted(counter.most_common(), key=lambda p: (-p[1], p[0])):
    number_of_asterisks = (50 * freq ) // max_freq     # (50 * N) / M
    asterisks = '*' * number_of_asterisks        # the (50*N)/M asterisks
    print('{:>50} {}'.format(asterisks, word))

フォーマット文字列は、:>50「50 文字までのスペースを含む左パッド」を意味します。

counter.most_common頻度でソートされた (単語、頻度) ペアのリストを返します
counter.most_common()[0][1]最初のペアの 2 番目の要素の場合、最大周波数
counter.most_common()最初に頻度の降順でソートされ、次に単語でソートされてループしています
number_of_asterisksあなたの式で計算されます。整数除算を使用//して整数結果を取得します。
アスタリスクのnumber_of_asterisks回数を繰り返し、結果をasterisks
印刷asterisksしてword. アスタリスクは、50 文字幅の列に右揃えで配置されます。

score 0 · Accepted Answer

コード：

import sys
import operator 
from collections import Counter
def candidateWord():
   with open("sample.txt", 'r') as f:
      text = f.read()
   words = [w.strip('!,.?1234567890-=@#$%^&*()_+')for w in text.lower().split()]
            #word_count[words] = word_count.get(words,0) + 1
   counter = Counter(words)

   # I added the code below...
   columns = 80
   n_occurrences = 10
   to_plot = counter.most_common(n_occurrences)
   labels, values = zip(*to_plot)
   label_width = max(map(len, labels))
   data_width = columns - label_width - 1
   plot_format = '{:%d}|{:%d}' % (label_width, data_width)
   max_value = float(max(values))
   for i in range(len(labels)):
     v = int(values[i]/max_value*data_width)
     print(plot_format.format(labels[i], '*'*v))

candidateWord()

出力:

the |***************************************************************************
and |**********************************************                             
of  |******************************************                                 
to  |***************************                                                
a   |************************                                                   
in  |********************                                                       
that|******************                                                         
i   |****************                                                           
was |*************                                                              
it  |**********

python - Pythonで単語頻度をグラフィカルなヒストグラムに変換する

2 に答える 2

Related

Reference