r - 兄弟の後の R XPath テキスト

Question

このhtmlから選択したい：

doc <- htmlParse("http://eusoils.jrc.ec.europa.eu/ESDB_Archive/ESDBv3/legend/sg_attr.htm")

しかし、特殊文字 (つまり > および < 記号) に問題があり、異なる長さのノードが取得されます。こちらを参照してください。

legs <- getNodeSet(doc, "//a")
leg_names <- sapply(legs, xmlGetAttr, "name")
leg_descr <- xpathSApply(doc, "//strong", xmlValue)

# not the same length??
cbind(leg_names, leg_descr)

# different length??
getNodeSet(doc, '//text()[following-sibling::a]')

と

# why is this not working?
getNodeSet(doc, '//a[@name="AGLIM1"]/text()[following-sibling::strong')

最後に、すべての凡例 (特定の名前の a タグの後のテキスト) を 2 つの列を持つテーブルに入れたいと思います。1 番目は値/シンボル、2 番目はラベルです。

WRB-FULL の場合は次のようになります。

     Value                  Label
        AB            Albeluvisol
      ABal       Alic Albeluvisol
      ABap   Abruptic Albeluvisol
      ABar     Arenic Albeluvisol
      ABau     Alumic Albeluvisol
     ABeun Endoeutric Albeluvisol
       ...        ...         ...

score 0 · Accepted Answer

ドキュメントの書式設定は一貫していません。<a>次の<strong>要素がない要素があります。つまり、前者が多くなっています。

cbind( head(leg_names,8), head(leg_descr,8) )
     [,1]            [,2]                                                                                                    
# [1,] "AGLIM1"        "AGLIM1: Code of the most important limitation to agricultural use of the STU"                          
# [2,] "AGLIM2"        "AGLIM2: Code of a secondary limitation to agricultural use of the STU"                                 
# [3,] "BORDER_SOIL1M" "FAO85-FULL: Full Soil Code 1974 FAO"                                                                   
# [4,] "SOIL1M"        "FAO85-LEV1: Soil major group code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend" 
# [5,] "CFL"           "FAO85-LEV2: Second level soil code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"
# [6,] "CL"            "FAO85-LEV3: Third level soil code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend" 
# [7,] "COUNTRY"       "FAO90-FULL:Full soil code of the STU from the 1990 FAO-UNESCO Soil Legend"                             
# [8,] "FAO85FU"       "FAO90-LEV1: Soil major group code of the STU from the 1990 FAO-UNESCO Soil Legend"

このfollowing-siblingアプローチはより有望に思えますが、<a>すぐに要素が続かない要素があるため<strong>、別の要素の記述になってしまう可能性があります。

getNodeSet(doc, '//a[@name="AGLIM1"]/following-sibling::strong/text()')[[1]]

別の方法は、フォーマットを忘れて、ファイルをテキストファイルと見なすことです。

raw_data <- readLines("http://eusoils.jrc.ec.europa.eu/ESDB_Archive/ESDBv3/legend/sg_attr.htm")
library(stringr)
matches <- str_extract(raw_data, '<a .*<strong>.*')
matches <- matches[ ! is.na(matches) ]
result <- str_match(matches, '<a name="(.*?)".*<strong>(.*)</strong>')[,-1]
head(result)
     [,1]       [,2]                                                                                                    
[1,] "AGLIM1"   "AGLIM1: Code of the most important limitation to agricultural use of the STU"                          
[2,] "AGLIM2"   "AGLIM2: Code of a secondary limitation to agricultural use of the STU"                                 
[3,] "FAO85FU"  "FAO85-FULL: Full Soil Code 1974 FAO"                                                                   
[4,] "FAO85LV1" "FAO85-LEV1: Soil major group code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend" 
[5,] "FAO85LV2" "FAO85-LEV2: Second level soil code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"
[6,] "FAO85LV3" "FAO85-LEV3: Third level soil code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"

r - 兄弟の後の R XPath テキスト

1 に答える 1

Related

Reference