ruby - xhtml ドキュメントで特定の単語を見つける最速の方法

Question

これを行うための最速の方法は何でしょうか。

「Instructions」という単語に続いて数行の指示が含まれている可能性がある (または含まれていない可能性がある) html ドキュメントが多数あります。「Instructions」という単語とそれに続く行を含むこれらのページを解析したいと考えています。

score 1 · Accepted Answer

たぶん、この線に沿った何か

require 'rubygems'
require 'nokogiri'

def find_instructions doc
  doc.xpath('//body//text()').each do |text|
    instructions = text.content.select do |line|
      # flip-flop matches all sections starting with
      # "Instructions" and ending with an empty line
      true if (line =~ /Instructions/)..(line =~ /^$/) 
    end
    return instructions unless instructions.empty?
  end
  return []
end

puts find_instructions(Nokogiri::HTML(DATA.read))


__END__
<html>
<head>
  <title>Instructions</title>
</head>
<body>
lorem
ipsum
<p>
lorem
ipsum
<p>
lorem
ipsum
<p>
Instructions
- Browse stackoverflow
- Answer questions
- ???
- Profit

More
<p>
lorem
ipsum
</body>
</html>

score 0 · Accepted Answer

ドキュメントが一致するかどうかをテストすることから始めることができます。

if open('docname.html').read =~ /Instructions/
  # Parse to remove the instructions.
end

Hpricot を使用して、必要な部分を抽出することをお勧めします。これは、html の構造によって多少難しくなります。より具体的なヘルプが必要な場合は、構造に関する詳細を投稿してください。

score 0 · Accepted Answer

これは最も「正しい」方法ではありませんが、ほとんどの場合は機能します。正規表現を使用して文字列を検索します: ruby regex

必要な正規表現は /instructions([^<]+)/ のようなものです。これは、< 文字で終わっていることを前提としています。

ruby - xhtml ドキュメントで特定の単語を見つける最速の方法

3 に答える 3

Related

Reference