xml - 階層データのスクレイピング

Question

グローバルなDeptストアから大陸/国のDeptストアのリストを取得しようとしています。次のコードを実行して、最初に大陸を取得します。XML階層は、各大陸を持つ国がその大陸の子ノードではないようになっていることがわかります。

> url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country"
> doc = htmlTreeParse(url, useInternalNodes = T)
> nodeNames = getNodeSet(doc, "//h2/span[@class='mw-headline']")
> # For Africa
> xmlChildren(nodeNames[[1]])
$a
<a href="/wiki/Africa" title="Africa">Africa</a> 

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"        
> xmlSize(nodeNames[[1]])
[1] 1

別のgetNodeSetコマンドで国を実行できることはわかっていますが、何かを見逃していないことを確認したかっただけです。各大陸内、次に各国内のすべてのデータを一度に取得するためのよりスマートな方法はありますか？

score 1 · Accepted Answer

uisng xpath、複数のパスを|と組み合わせることができますセパレータ。だから私はそれを使って同じリストにある国と店を手に入れました。次に、国の2番目のリストを取得します。後者のリストを使用して最初のリストを分割します

url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country"
library(XML)
xmltext <- htmlTreeParse(url, useInternalNodes = T)

## Here I use the combined xpath 
cont.shops <- xpathApply(xmltext, '//*[@id="mw-content-text"]/ul/li|
                                   //*[@id="mw-content-text"]/h3',xmlValue)
cont.shops<- do.call(rbind,cont.shops)                  ## from list to  vector


head(cont.shops)                  ## first element is country followed by shops
     [,1]                   
[1,] "[edit] Â Tunisia"     
[2,] "Magasin GÃƒÂ©nÃƒÂ©ral"
[3,] "Mercure Market"       
[4,] "Promogro"             
[5,] "Geant"                
[6,] "Carrefour"            
## I get all the contries in one list 
contries <- xpathApply(xmltext, '//*[@id="mw-content-text"]/h3',xmlValue)
contries <- do.call(rbind,contries)                     ## from list to  vector

    head(contries)
     [,1]                   
[1,] "[edit] Â Tunisia"     
[2,] "[edit] Â Morocco"     
[3,] "[edit] Â Ghana"       
[4,] "[edit] Â Kenya"       
[5,] "[edit] Â Nigeria"     
[6,] "[edit] Â South Africa"

今、私は国を使用してcont.shopsを分割するためにいくつかの処理を行います。

dd <- which(cont.shops %in% contries)                   ## get the index of contries
freq <- c(diff(dd),length(cont.shops)-tail(dd,1)+1)     ## use diff to get Frequencies
contries.f <- rep(contries,freq)                        ## create the factor splitter


ll <- split(cont.shops,contries.f)

結果を確認できます：

> ll[[contries[1]]]
[1] "[edit] Â Tunisia"      "Magasin GÃƒÂ©nÃƒÂ©ral" "Mercure Market"        "Promogro"              "Geant"                
[6] "Carrefour"             "Monoprix"             
> ll[[contries[2]]]
[1] "[edit] Â Morocco"                                                         
[2] "Alpha 55, one 6-story store in Casablanca"                                
[3] "Galeries Lafayette, to open in 2011[1] within Morocco Mall, in Casablanca"

xml - 階層データのスクレイピング

1 に答える 1

Related

Reference