ruby - textscraper で重複を防止または削除しますか?

Question

フォルダー内のテキストファイルを解析し、特定の検索語の前後に定義済みの数の単語を保存するコードがあります。

たとえば、「日付」や「年」などの単語を検索します。同じ文で両方が見つかった場合、その文は 2 回保存されます。さらに、文中に同じ単語が数回使用されている場合は、複数回保存します。

このようにして、スクレイパーは大量の不要な重複テキストを節約します。

考えられる解決策は 2 つあります。

次の検索一致が前の単語のグループのパディングにある場合、それは保存されません。
たとえば、検索一致の 7 つの単語のグループが前のグループの一部でもある場合、それは保存/削除されません。

私が試したことはすべて、これまでのところ完全に失敗しています:

#helper
def indices text, index, word
    padding = 200
    bottom_i = index - padding < 0 ? 0 : index - padding
    top_i = index + word.length + padding > text.length ? text.length : index +         word.length + padding
    return bottom_i, top_i
end

#script
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")

Dir.glob("*.txt").each do |textfile|
    whole_file = File.open(textfile, 'r').read
    puts "Currently summarizing " + textfile + "..."
    curr_i = 0
    str = nil
    whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
      if i_match = whole_file.index(match, curr_i)
        top_bottom = indices(whole_file, i_match, match)
        base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " +         File.path(textfile))
        curr_i += i_match                     
      end
    end
    puts "Done summarizing " + textfile + "."
end
base_text.close

score 0 · Accepted Answer

できれば以下よりも優れたもの：

whole_file.scan(Regexp.union(/firstword/, /secondword/).each do |match|
  if i_match = whole_file.index(match, curr_i)
    top_bottom = indices(whole_file, i_match, match)
    base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " +         File.path(textfile))
    curr_i += i_match + 50                     
  end        
end

score 0 · Accepted Answer

探しているものを追跡する何かをしてみませんか:

search_words = %w( year date etc )

次に、検索文字列を小文字にして、インデックスを開始します。

def summarize(str)
  search_str = str.downcase
  ind = 0

次に、search_str で検索語の最小インデックスオフセットを見つけ、(ind + オフセット - デルタ) までのすべてを削除し、(ind + デルタ) まで一致するように移動し、while ループを続行します。何かのようなもの：

  matches = []
  while (offset = search_words.map{|w| search_str.index w }.min)
    ind += offset
    matches.push str[ind - delta, delta * 2]
    search_str = search_str[offset + delta, ]
  end
  matches
end

ruby - textscraper で重複を防止または削除しますか?

2 に答える 2

Related

Reference