r - R で複雑な HTML テーブルを data.frame にスクレイピングする

Question

米国最高裁判所判事に関するウィキペディアのデータをRにロードしようとしています:

library(rvest)

html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])

[1] "Wilson, JamesJames Wilson"       "Jay, JohnJohn Jay†"             
[3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."     
[5] "Rutledge, JohnJohn Rutledge"     "Iredell, JamesJames Iredell"

問題は、データの形式が正しくないことです。実際の HTML テーブルに表示される名前 (「James Wilson」) ではなく、実際には「Lastname, Firstname」として 1 回、次に「Firstname Lastname」として 2 回表示されます。

その理由は、実際にはそれぞれに invisible が含まれているためです:

<td style="text-align:left;" class="">
    <span style="display:none" class="">Wilson, James</span>
    <a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a>
</td>

同じことが、数値データを含む列にも当てはまります。HTML テーブルをソートするには、この余分なコードが必要だと推測しています。ただし、R のテーブルから data.frame を作成しようとするときに、これらのスパンを削除する方法がわかりません。

score 9 · Accepted Answer

多分このように

library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson"       "Jay, JohnJohn Jay†"              "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."     
# [5] "Rutledge, JohnJohn Rutledge"     "Iredell, JamesJames Iredel

removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson"    "John Jay†"       "William Cushing" "John Blair, Jr." "John Rutledge"   "James Iredell"

score 4 · Accepted Answer

あなたはrvestを使うことができます

library(rvest)

html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")%>%   
  html_nodes("span+ a") %>% 
  html_text()

完璧ではないので、css セレクターを改良したいかもしれませんが、かなり近くなります。

r - R で複雑な HTML テーブルを data.frame にスクレイピングする

2 に答える 2

Related

Reference