ruby - Net ::HTTP応答をRuby1.9.1で特定のエンコーディングに変換する方法は？

Question

私は次のことを行うSinatraアプリケーション（http://analyzethis.espace-technologies.com）を持っています

HTMLページを取得する（net / http経由）
response.bodyからNokogiriドキュメントを作成します
いくつかの情報を抽出し、応答で送り返します。応答はUTF-8でエンコードする必要があります

そのため、www.filfan.comやwww.masrawy.comなどのwindows-1256エンコーディングを使用するサイトを読み込もうとしたときに問題が発生しました。

問題は、エラーがスローされないにもかかわらず、エンコーディング変換の結果が正しくないことです。

net / http response.body.encodingは、UTF-8に変換できないASCII-8BITを提供します

Nokogiri :: HTML（response.body）を実行し、cssセレクターを使用してページから特定のコンテンツ（たとえば、タイトルタグのコンテンツなど）を取得すると、string.encodingを呼び出すとWINDOWS-1256を返す文字列が取得されます。。string.encode（ "utf-8"）を使用し、それを使用して応答を送信しますが、応答が正しくありません。

私のアプローチの何が悪いのかについての提案やアイデアはありますか？

score 27 · Accepted Answer

Net::HTTP はエンコーディングを正しく処理しないためです。http://bugs.ruby-lang.org/issues/2567を参照

response['content-type']全体を解析する代わりに、文字セットを含むものを解析できますresponse.body。

次にforce_encoding()、正しいエンコーディングを設定するために使用します。

response.body.force_encoding("UTF-8")サイトが UTF-8 で提供されている場合。

score 3 · Accepted Answer

次のコードが機能していることがわかりました

def document
  if @document.nil? && response
    @document = if document_encoding
                  Nokogiri::HTML(response.body.force_encoding(document_encoding).encode('utf-8'),nil, 'utf-8')
                else
                  Nokogiri::HTML(response.body)
                end
  end
  @document
end

def document_encoding
  return @document_encoding if @document_encoding
  response.type_params.each_pair do |k,v|
    @document_encoding = v.upcase if k =~ /charset/i
  end
  unless @document_encoding
    #document.css("meta[http-equiv=Content-Type]").each do |n|
    #  attr = n.get_attribute("content")
    #  @document_encoding = attr.slice(/charset=[a-z1-9\-_]+/i).split("=")[1].upcase if attr
    #end
    @document_encoding = response.body =~ /<meta[^>]*HTTP-EQUIV=["']Content-Type["'][^>]*content=["'](.*)["']/i && $1 =~ /charset=(.+)/i && $1.upcase
  end
  @document_encoding
end

ruby - Net ::HTTP応答をRuby1.9.1で特定のエンコーディングに変換する方法は？

2 に答える 2

Related

Reference