xml - httr - xml をテキストとしてではなく解析しますが、エンコードを指定します

Question

httr パッケージを使用して UTF-8 でエンコードされた Web サイトをスクレイピングしようとしていますが、明らかcontentにそのパッケージの機能では、Web サイトをテキストとして解析する場合にのみエンコードを指定できます。残念ながら、後で xpath クエリを使用したいので、テキストとして解析することはできません。次に例を示します。

library(XML)
library(httr)

page <- GET("http://ec.europa.eu/archives/commission_2004-2009/index_en.htm")
test <- content(page, as = "parsed")
# Get a list of names, many of which contain non-standard characters
xpathSApply(test, "//img", xmlGetAttr, "alt") 

# This gives the correct encoding, but outputs a character vector, 
# on which I cannot use xpath queries
test <- content(page, as = "text", encoding = "utf-8")

アップデート：

# htmlParse returns a parsed document, but the non-standard characters are 
# not properly encoded, i.e. the result is the same whether or not I specify the 
# "encoding" argument
test <- htmlParse(page, encoding = "UTF-8")

# Non-standard characters in names still not properly encoded
xpathSApply(test, "//img", xmlGetAttr, "alt")

score 0 · Accepted Answer

試す：

 test <- htmlParse("http://ec.europa.eu/archives/commission_2004-2009/index_en.htm")
 res <- xpathSApply(test, "//img", xmlGetAttr, "alt")
 tail(res)
 #[1] "Slovakian"   "PDF"         "PDF"         "PDF"         "PDF - 66 KB"
#[6] "français"

コードの使用 (1 番目と 2 番目)

 tail(res1)
 #[1] "Slovakian"   "PDF"         "PDF"         "PDF"         "PDF - 66 KB"
 #[6] "franÃ§ais"  

  tail(res2)
 #[1] "Slovakian"   "PDF"         "PDF"         "PDF"         "PDF - 66 KB"
 #[6] "franÃ§ais"

xml - httr - xml をテキストとしてではなく解析しますが、エンコードを指定します

1 に答える 1

Related

Reference