html - Ruby で HTML テキストをダウンロードする

Question

指定した Web ページの文字 (a、b、c など) のヒストグラムを作成しようとしています。ハッシュを使用してヒストグラム自体を作成する予定です。ただし、実際に HTML を取得する際に少し問題があります。

私の現在のコード:

#!/usr/local/bin/ruby


require 'net/http'
require 'open-uri'


# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)

def open(url)
    Net::HTTP.get(URI.parse(url))
end

page_content = open('_insert_webpage_here')

page_content.each do |i|
    puts i
end

これにより、HTML の取得がうまくいきます。しかし、それはすべてを取得します。www.stackoverflow.com の場合、次のようになります。

<body><h1>Object Moved</h1>This document may be found <a HREF="http://stackoverflow.com/">here</a></body>

それが正しいページであるふりをして、html タグは必要ありません。取得しようとしているだけObject MovedですThis document may be found here。

これを行う合理的に簡単な方法はありますか？

score 2 · Accepted Answer

の場合、 Net::HTTP でrequire 'open-uri'再定義する必要はありません。open

require 'open-uri'

page_content = open('http://www.stackoverflow.com').read

histogram = {}
page_content.each_char do |c|
  histogram[c] ||= 0
  histogram[c] += 1
end

注: これは<tags>HTML ドキュメント内で削除されないため、<html><body>x!</body></html>代わり{ '<' => 4, 'h' => 2, 't' => 2, ... }に{ 'x' => 1, '!' => 1 }. タグを削除するには、 Nokogiri のようなもの (利用できないと言った)、またはある種の正規表現 ( Dru's answerのものなど) を使用できます。

score 1 · Accepted Answer

ここの Net::HTTP ドキュメントの「リダイレクト後のリダイレクト」セクションを参照してください。

score 1 · Accepted Answer

のこぎりなしでhtmlタグを剥がす

puts page_content.gsub(/<\/?[^>]*>/, "")

http://codesnippets.joyent.com/posts/show/615

html - Ruby で HTML テキストをダウンロードする

3 に答える 3

Related

Reference