ruby - のこぎりテキストノードの内容

Question

Nokogiriでテキストノードのコンテンツを取得するためのクリーンな方法はありますか？今使っています

some_node.at_xpath( "//whatever" ).first.content

これは、テキストを取得するだけでは本当に冗長に思えます。

score 15 · Accepted Answer

あなたはテキストだけをしたいですか？

doc.search('//text()').map(&:text)

たぶん、すべての空白とノイズは必要ありません。単語文字を含むテキストノードのみが必要な場合は、

doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}

編集：単一のノードのテキストコンテンツのみが必要だったようです：

some_node.at_xpath( "//whatever" ).text

score 9 · Accepted Answer

テキストノードを探すだけです：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>This is a text node </p>
<p> This is another text node</p>
</body>
</html>
EOT

doc.search('//text()').each do |t|
  t.replace(t.content.strip)
end

puts doc.to_html

どの出力:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>This is a text node</p>
<p>This is another text node</p>
</body></html>

ところで、コード例は機能しません。at_xpath( "//whatever" ).first冗長で失敗します。at_xpath最初に見つかったものだけを見つけて、Node を返します。firstその時点で機能する場合は不要ですが、Node にはメソッドがないため機能しませんfirst。

私は<data><foo>bar</foo></bar>、「バー」テキストを実行せずに取得するにはどうすればよいdoc.xpath_at( "//data/foo" ).children.first.contentですか?

doc解析された DOM が含まれていると仮定します。

doc.to_xml # => "<?xml version=\"1.0\"?>\n<data>\n  <foo>bar</foo>\n</data>\n"

最初のオカレンスを取得します。

doc.at('foo').text       # => "bar"
doc.at('//foo').text     # => "bar"
doc.at('/data/foo').text # => "bar"

すべての出現を取得し、最初のものを取ります:

doc.search('foo').first.text      # => "bar"
doc.search('//foo').first.text    # => "bar"
doc.search('data foo').first.text # => "bar"

ruby - のこぎりテキストノードの内容

2 に答える 2

Related

Reference