ruby - hpricot/nokogiri で h2 要素の前のすべての要素を検索する

Question

ウィクショナリーのエントリを解析して、すべての英語の定義を取得しようとしています。すべての定義を取得できますが、問題は、一部の定義が他の言語であるということです。私がやりたいことは、どういうわけか英語の定義を持つ HTML ブロックだけを取得することです。他の言語エントリがある場合、英語の定義の後のヘッダーを次のように取得できることがわかりました。

header = (doc/"h2")[3]

したがって、このヘッダー要素の前のすべての要素のみを検索したいと思います。で可能かもしれないと思ったheader.preceding_siblings()のですが、うまくいかないようです。助言がありますか？

score 2 · Accepted Answer

のこぎりでビジターパターンを活用できます。このコードは、他の言語定義のh2から始まるすべてのものを削除します。

require 'nokogiri'
require 'open-uri'

class Visitor
  def initialize(node)
    @node = node
  end

  def visit(node)
    if @remove || @node == node
      node.remove
      @remove = true
      return
    end
    node.children.each do |child|
      child.accept(self)
    end
  end
end

doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2]  #In this case, the Italian h2 is at index 2.  Your page may differ

doc.root.accept(Visitor.new(node))  #Removes all page contents starting from node

score 1 · Accepted Answer

次のコードはHpricotを使用しています。
英語のヘッダー (h2) から次のヘッダー (h2) まで、または他に言語がない場合はフッターまでのテキストを取得します。

require 'hpricot'
require 'open-uri'

def get_english_definition(url)
  doc = Hpricot(open(url))

  span = doc.at('h2/span[@class="mw-headline"][text()=English]')
  english_header = span && span.parent
  return nil unless english_header

  next_header_or_footer =
    Hpricot::Elements[*english_header.following_siblings].at('h2') ||
    doc.at('[@class="printfooter"]')

  Hpricot::Elements.expand(english_header.next_node,
                           next_header_or_footer.previous_node).to_s
end

例：

get_english_definition "http://en.wiktionary.org/wiki/gift"

score 1 · Accepted Answer

のこぎりの場合：

doc = Nokogiri::HTML(code)
stop_node = doc.css('h2')[3]
doc.traverse do |node|
  break if node == stop_node
  # else, do whatever, e.g. `puts node.name`
end

これにより、2 行目で指定したノードの前にあるすべてのノードが反復処理されますstop_node。

ruby - hpricot/nokogiri で h2 要素の前のすべての要素を検索する

3 に答える 3

Related

Reference