r - R: unlist(xpathSApply( )) を使用して Webscrape を試みると、結果は NULL になります

Question

私は次の Web サイトで作業しています: http://www.crowdrise.com/skollsechallenge

具体的には、このページには 57 のクラウドファンディングキャンペーンがあります。これらのクラウドファンディングキャンペーンにはそれぞれ、なぜ資金を調達したいのか、これまでに調達した資金の総額、チームメンバーを詳述するテキストがあります。一部のキャンペーンでは、資金調達の目標も指定されています。57 の各サイトからこの情報をスクレイピングして整理する R コードを書きたいと思います。

57 社のそれぞれについて、これらすべての情報を含む表を作成するために、57 のキャンペーンのそれぞれの名前を抽出できるようにする関数を最初に生成しました。

  #import packages
  library("RCurl")
  library("XML")
  library("stringr")

  url <- "http://www.crowdrise.com/skollSEchallenge"
  url.data <- readLines(url) 
  #the resulting url.data is a character string
  #remove spaces
  url.data <- gsub('\r','', gsub('\t','', gsub('\n','', url.data)))  
  index.list <- grep("username:",url.data)
  #index.list is a list of integers that indicates indexes of url.data that includes name      
  #of each of the 57 campaigns  
  length.index.list<-length(index.list)
  length.index.list
  vec <-vector ()

  #store the 57 usernames in one vector
    for(i in 1:length.index.list){
      username<-url.data[index.list[i]]
      real.username <- gsub("username:","",username)
      vec[i] <- c(real.username)
    }

次に、R が 57 のキャンペーン Web ページのそれぞれにアクセスし、webscraping を実行できるようにするループを作成しようとしました。

 # Extract all necessary paragraphs. Unlist flattens the list to 
 #create a character vector.

    for(i in 1:length(vec)){
    end.name<-gsub('\'','',vec[i])
    end.name<-gsub(',','',end.name)
    end.name<-gsub(' ','',end.name)
    user.address<-paste(c("http://www.crowdrise.com/skollSEchallenge/",
    end.name),collapse='') 
    user.url<-getURL(user.address)

    html <- htmlTreeParse(user.url, useInternalNodes = TRUE)
    website.donor<-unlist(xpathSApply(html,'//div[@class="grid1-4 "]//h4', xmlValue))
    website.title<-unlist(xpathSApply(html,'//div[@class="project_info"]',xmlValue))
    website.story<-unlist(xpathSApply(html,'//div[@id="thestory"]',xmlValue))
    website.fund<-unlist(xpathSApply(html,'//div[@class="clearfix"]',xmlValue))

    #(NOTE: doc.text<- readHTMLTable(webpage1) doesn't work 
    #due to the poor html structure of the website)
    # Replace all \n by spaces, and eliminate all \t
    website.donor <- gsub('\\n', ' ', website.donor)
    website.donor <- gsub('\\t','',website.donor)
    website.title <- gsub('\\n', ' ', website.title)
    website.title <- gsub('\\t','',website.title)
    website.story <- gsub('\\n', ' ', website.story)
    website.story <- gsub('\\t','',website.story)
    website.fund <- gsub('\\n', ' ', website.fund)
    website.fund <- gsub('\\t','',website.fund)

    ## all those tabs and spaces are just white spaces that we can trim
    website.title <- str_trim(website.title)
    website.fund   <- str_trim(website.fund)
    website.data<- cbind(website.title, website.story, website.fund, website.donor)
    data[[i]]<- website.data
    Sys.sleep(1)
   }
  data <- data.frame(do.call(rbind,data), stringAsFactors=F)

コマンド

   unlist(xpathSApply(html,'//div[@class="grid1-4 "]//h4', xmlValue))
   unlist(xpathSApply(html,'//div[@class="project_info"]',xmlValue))
   unlist(xpathSApply(html,'//div[@id="thestory"]',xmlValue))
   unlist(xpathSApply(html,'//div[@class="clearfix"]',xmlValue))

NULL 値を与えていますが、その理由がわかりません。

それらが NULL になるのはなぜですか?どうすれば修正できますか?

ありがとうございました、

r - R: unlist(xpathSApply( )) を使用して Webscrape を試みると、結果は NULL になります

2 に答える 2

Related

Reference