xml - R: XML パッケージの代わりに rvest パッケージを使用して URL からリンクを取得する

Question

XML パッケージを使用して、この urlからリンクを取得します。

# Parse HTML URL
v1WebParse <- htmlParse(v1URL)
# Read links and and get the quotes of the companies from the href
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

この方法は非常に効率的ですが、私rvestはXML. 試しhtml_nodesてみhtml_attrsましたが、うまくいきません。

score 4 · Accepted Answer

rvestあなたが答えを探していることは知っていますが、XMLパッケージを使用する別の方法は、あなたがしていることよりも効率的かもしれません.

getLinks()の関数を見たことがありexample(htmlParse)ますか? 例からこの修正版を使用して、hrefリンクを取得します。これはハンドラー関数であるため、読み取り時に値を収集して、メモリを節約し、効率を高めることができます。

links <- function(URL) 
{
    getLinks <- function() {
        links <- character()
        list(a = function(node, ...) {
                links <<- c(links, xmlGetAttr(node, "href"))
                node
             },
             links = function() links)
        }
    h1 <- getLinks()
    htmlTreeParse(URL, handlers = h1)
    h1$links()
}

links("http://www.bvl.com.pe/includes/empresas_todas.dat")
#  [1] "/inf_corporativa71050_JAIME1CP1A.html"
#  [2] "/inf_corporativa10400_INTEGRC1.html"  
#  [3] "/inf_corporativa66100_ACESEGC1.html"  
#  [4] "/inf_corporativa71300_ADCOMEC1.html"  
#  [5] "/inf_corporativa10250_HABITAC1.html"  
#  [6] "/inf_corporativa77900_PARAMOC1.html"  
#  [7] "/inf_corporativa77935_PUCALAC1.html"  
#  [8] "/inf_corporativa77600_LAREDOC1.html"  
#  [9] "/inf_corporativa21000_AIBC1.html"     
#  ...
#  ...

score 2 · Accepted Answer

# Option 1
library(RCurl)
getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat')

# Option 2
library(rvest)
library(pipeR) # %>>% will be faster than %>%
html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")

score 0 · Accepted Answer

リチャードの答えはHTTPページでは機能しますが、必要なHTTPSページでは機能しません(ウィキペディア)。RCurl の getURL 関数を以下のように置き換えました。

library(RCurl)

links <- function(URL) 
{
  getLinks <- function() {
    links <- character()
    list(a = function(node, ...) {
      links <<- c(links, xmlGetAttr(node, "href"))
      node
    },
    links = function() links)
  }
  h1 <- getLinks()
  xData <- getURL(URL)
   htmlTreeParse(xData, handlers = h1)
  h1$links()
}

xml - R: XML パッケージの代わりに rvest パッケージを使用して URL からリンクを取得する

4 に答える 4

Related

Reference