ruby - テキストの説明から一般的な単語を単純にフィルタリングする

Question

「a」、「the」、「best」、「kind」などの言葉。これを達成する良い方法があると確信しています

明確にするために、私は探しています

できれば Ruby で実装できる最も単純なソリューション。
エラーに対する許容度が高い
一般的なフレーズのライブラリが必要な場合は、それにも完全に満足しています

score 2 · Accepted Answer

これらの一般的な単語は「ストップワード」として知られています。これについては、同様のスタックオーバーフローの質問があります:英語の「ストップワード」リスト?

要約する：

処理するテキストが大量にある場合は、その特定のデータセット内の単語の頻度に関する統計を収集し、最も頻度の高い単語をストップワードリストに使用することをお勧めします。(あなたの例に「種類」が含まれているということは、たとえば「種類」のような口語表現がたくさんある、非常に珍しいデータのセットがある可能性があることを示唆しているので、おそらくこれを行う必要があるでしょう。)
あなたはエラーについてあまり気にしないと言っているので、他の誰かが作成した英語のストップワードのリストを使用するだけで十分かもしれません.

これらの単語をプログラムのハッシュに入れるだけで、単語のリストを簡単にフィルタリングできます。

score 1 · Accepted Answer

  Common = %w{ a and or to the is in be }
Uncommon = %{
  To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
  Common.each { |w| ignore_me[w.downcase] = :Common          }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join

 ,  not  : that   question: 
Whether 'tis nobler   mind  suffer
 slings  arrows of outrageous fortune,
  take arms against  sea of troubles,
 by opposing end them?  die:  sleep;
No more;  by  sleep  say we end
 heart-ache   thousand natural shocks
That flesh  heir , 'tis  consummation
Devoutly   wish'd.  die,  sleep;
 sleep: perchance  dream: ay, there's  rub;
For  that sleep of death what dreams may come

score 1 · Accepted Answer

これは、DigitalRoss の回答のバリエーションです。

str=<<EOF
To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
EOF

common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')

また関連: ある文字列の単語が別の文字列にあるかどうかを確認する最も速い方法は何ですか?

score 0 · Accepted Answer

名前付きを削除する単語の配列がある場合は、stop_words次の式から結果を取得します。

description.scan(/\w+/).reject do |word|
  stop_words.include? word
end.join ' '

各単語の間に単語以外の文字を保持する場合は、

description.scan(/(\w+)(\W+)/).reject do |(word, other)|
  stop_words.include? word
end.flatten.join

score 0 · Accepted Answer

ちょっと待って、ストップワード (別名ノイズワード、ジャンクワード) を削除する前に、いくつかの調査を行う必要があります。問題は、インデックスのサイズと処理リソースだけではありません。エンドユーザーがクエリを入力するか、長い自動クエリを使用するかによって、多くのことが異なります。

すべての検索ログ分析によると、ユーザーはクエリごとに 1 ～ 3 語を入力する傾向があります。検索で処理する必要があるのはこれだけの場合、何かを失うわけにはいきません。たとえば、コレクションには多くのドキュメントで「著作権」という単語が含まれている場合があり、非常に一般的になっていますが、インデックスに単語がない場合、正確なフレーズ検索や近接関連性ランキングを行うことは不可能です。さらに、最も一般的な単語を検索する完全に正当な理由があります。人々は「The Who」、さらに悪いことに「The The」を検索している可能性があります。

したがって、考慮すべき技術的な問題があり、ストップワードを取り除くことは 1 つの解決策ですが、解決しようとしている全体的な問題に対する適切な解決策ではない可能性があります。

ruby - テキストの説明から一般的な単語を単純にフィルタリングする

5 に答える 5

Related

Reference