1

文字列内の単語の頻度をカウントするメソッドがあります。削除すべき単語を手動で含めています。私が見つけたのは、短い文字列の場合、「the」が削除されていることです...以下のような長い文字列の場合、メソッドは引き続き「the」を出力します。これがなぜなのか、どのように修正するのかについてのアイデアはありますか?

def count_words(string)
    words = string.downcase.split(' ')

    delete_list = ['the']
    delete_list.each do |del|
        words.delete_at(words.index(del))
    end

    frequency = Hash.new(0)
    words.each do |word|
        frequency[word.downcase] += 1
    end

    return frequency.sort_by {|k,v| v}.reverse
end

puts count_words('Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
sales metrics often do not reflect the contributions of the role, which demonstrates that line management is out of touch of what the individual contributors role really does
middle management does not care about the career of his/her directs, 90% of the time management competes directly with their people, or takes credit for their work
lots of back stabbing going on
Microsoft changes the organization or commitment or comp model, faster than the average deal cycle, making it next to near impossible to develop momentum in role or a rhythm of success
execs promote themselves in years when they freeze employees merit increases
only way to advance is to step on your peers/colleagues and take credit for work you had no impact on, beat your chest loud enough and you get "visibility" you need to advance
visibility is not based on performance by enlarge, it is based on being in your manager\'s swim lane for advancement
I have observed people get promoted in years when they did not meet their quota, nor did the earn the highest performance on the team, they kissed their way to the promotion
Advice to Senior Management 1, get back to risk taking and teaming, less politics please, you are killing the company
2, set realistic commitments and stick to them for multiple years, stop changing the game faster than your people can react
3, stop over engineering commitments and over segmenting the company, people are not willing to collaborate or be corporate citizens
4, too many empty suits in middle management, keep flattening out the company and getting rid of middle managers that run reports all day, get back to a culture where managers also sell and drives wins
5, keep your word microsoft, you said stability, but you keep tinkering with the org too much for any changes to take affect A great Culture
Limitless opportunities
Supportive Management team who are passionate about people
A company that really does want you to have a good work life balance and backs it up with policies that enable you to manage how and where you work.
Cons Support resources are constrained
Can be overly competitve and hard to get noticed
Sales rewards are definitely prioritised and marketing cuts are always prioritised.
Consumer organisation is still far from ideal.
Advice to Senior Management Focus on getting the internal organisation simplified to improve performance and increase empowerment.
Get some REAL consumer focus and invest for the long term
Start connecting with people, focussing on telling stories rather than selling products.')
4

2 に答える 2

1

これは、SEO のために Web ページを分析する際によくある問題です。ここに私が書いたような簡単なバージョンがあります:

require 'pp'

STOP_WORDS = %w[a and of the]

def count_words(string)

  word_count = string
    .downcase
    .gsub(/[^a-z ]+/, '')
    .split
    .group_by{ |w| w }

  STOP_WORDS.each do |stop_word|
    word_count.delete(stop_word)
  end

  word_count
    .map{ |k,v| [k, v.size]}
    .sort_by{ |k, c| [-c, k] }
end

pp count_words(<<EOT)
Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
Start connecting with people, focussing on telling stories rather than selling products.
EOT

読みやすくするために、サンプル データを意図的に切り捨てました。

そのトピックで<<は、多くのテキストを渡す必要がある場合に、here-to (" ") を使用してコードの書式設定を改善できます。別の方法として、マーカーを挿入して__END__その後に配置し、特別な IO オブジェクトを使用しDATAて末尾のブロックを読み取ります。

pp count_words(DATA.read)

__END__
Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
Start connecting with people, focussing on telling stories rather than selling products.

いずれの場合も、コードは次を出力します。

[["の", 2],
 [「と」、1]、
 [「ある」、1]、
 [「利点」、1]、
 ["建物の開始", 1],
 [「補償」、1]、
 [「接続」、1]、
 [「短所」、1]、
 [「コントロール」、1]、
 [「空」、1]、
 [「フェア」、1]、
 [「フォーカシング」、1]、
 [「ゲート」、1]、
 [「得た」、1]、
 [「素晴らしい」、1]、
 [「持っている」、1]、
 [「左」、1]、
 [「少し」、1]、
 [「管理」、1]、
 ["中間", 1],
 [「オフ」、1]、
 [「オン」、1]、
 [「アウト」、1]、
 [「人」、1]、
 [「製品」、1]、
 [「長所」、1]、
 [「むしろ」、1]、
 [「合理的」、1]、
 [「リスク」、1]、
 [「販売」、1]、
 [「以来」、1]、
 [「ストーリー」、1]、
 ["スーツ", 1],
 [「政治を取る」、1]、
 [「伝える」、1]、
 [「より」、1]、
 [「時間」、1]、
 [「非常に」、1]、
 [「ビジョン」、1]、
 [「ボイド」、1]、
 [「と」、1]]

gsub(/[^a-z ]+/, '')文字でもスペースでもないものを取り除きます。Enumerablegroup_byが面倒な作業を行っています。また、Enumerablesort_byを使用すると、カウントと単語による逆の並べ替えを簡単に行うことができます。

ストップ ワードを削除するときは、配列の代わりにハッシュを使用します。これSTOP_WORDは、通常、コーパス内の単語を反復処理するよりも、リストを反復処理する方が高速であるためです。大きなコーパスには、ストップ ワードよりも多くの単語が含まれている可能性が非常に高くなります。

于 2013-04-10T03:26:11.567 に答える
1

を使用するだけwords.delete("the")です。あなたがする必要があるのは、それに鍵を与えることだけです。

プログラムのより単純なバージョンは次のようになります。

def count_words(string)
  words = string.downcase.split(' ').each_with_object(Hash.new(0)) { |w,o| o[w] += 1 }

  delete_list = ['the']

  delete_list.each { |del| words.delete(del) }

  frequency.sort_by {|k,v| v}.reverse
end
于 2013-04-10T01:57:54.760 に答える