r - xpathで指定したテーブル抽出

Question

Web http://en.wikipedia.org/wiki/Brazil_national_football_teamからテーブルを抽出したい

library(XML)
baseURL <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
xmltext <- htmlParse(baseURL)
xmltable <- xpathApply(xmltext, "//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")

xpath は次のとおりです。"//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]"

ない

xmltable <- xpathApply(xmltext, "//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")

または

xmltable <- xpathApply(xmltext, "//table[//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")

指定したテーブルを取得できます。xpath 式をどのように記述できますか?
添付ファイルをご覧ください。ここに画像の説明を入力

score 1 · Accepted Answer

..xpath で親要素を取得するために使用する必要があります。//table[@class='wikitable']//th//a[@title='CONCACAF Gold Cup']/../../..

使用できるテーブルを取得するにはXML::readHTMLTable：

library(XML)
baseURL <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
xmltext <- htmlParse(baseURL)

## grep correct table
tableNode <- xpathApply(xmltext, "//table[@class='wikitable']//th//a[@title='CONCACAF Gold Cup']/../../..")[[1]]

## convert XMLNode into data.frame
concacafTable <- readHTMLTable(tableNode, header=FALSE, stringsAsFactors=FALSE)

## format table (remove useless "Gold Cup"-header (row 1) and set right header (row 2)
colnames(concacafTable) <- concacafTable[2, ]
concacafTable <- concacafTable[-c(1,2),]
concacafTable
#   Year       Round GP W D L GF GA
#3  1996  Runners-up  4 3 0 1 10  3
#4  1998 Third Place  5 2 2 1  6  2
#5  2003  Runners-up  5 3 0 2  6  4                                                 
#6 Total        3/11 14 8 2 4 22  9

score 0 · Accepted Answer

ウェブの解析でも 2 人の秘書を見つけ、

1.体が分からない

tableNode <- xpathApply(xmltext, "//tbody")

Web には多くの tbody 要素がありますが、いずれも形式要素として認識されませんでした。

2.親要素の概念を使わず、テーブルを直接取得する

tableNode <- xpathApply(xmltext, "//table[@class='wikitable'][./tr/th/a[@title='CONCACAF Gold Cup']]") can work too.

r - xpathで指定したテーブル抽出

2 に答える 2

Related

Reference