xml - RCorpusの各ドキュメントのXPath

Question

DirSourceを使用してディレクトリから作成されたコーパスxがRにあります。各ドキュメントは、関連するvBulletinフォーラムのWebページの完全なHTMLを含むテキストファイルです。これはスレッドであるため、各ドキュメントには、XPathでキャプチャしたい複数の個別の投稿があります。XPathは機能しているようですが、キャプチャしたすべてのノードをコーパスに戻すことはできません。

私のコーパスにそれぞれ平均4つの投稿がある25のドキュメントがある場合、新しいコーパスには100のドキュメントがあるはずです。ループして新しいコーパスを作成する必要があるのではないかと思います。

これが私のこれまでの厄介な仕事です。www.vbulletin.org/forum/のスレッドからのソースは、構造の例です。

#for stepping through
xt <- x[[5]]
xpath <- "//div[contains(@id,'post_message')]"

getxpath <- function(xt,xpath){
  require(XML)

  #either parse
  doc <- htmlParse(file=xt)
  #doc <- htmlTreeParse(tolower(xt), asText = TRUE, useInternalNodes = TRUE)

  #don't know which to use
  #result <- xpathApply(doc,xpath,xmlValue)
  result <- xpathSApply(doc,xpath,xmlValue)

  #clean up
  result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=result))

  result <- c(result[1:length(result)])

  free(doc)

  #converts group of nodes into 1 data frame with numbers before separate posts
  #require(plyr)
  #xbythread <- ldply(.data=result,.fun=function(x){unlist(x)})

  #don't know what needs to be returned
  result <- Corpus(VectorSource(result))
  #result <- as.PlainTextDocument(result)

  return(result)
}

#call
x2 <- tm_map(x=x,FUN=getxpath,"//div[contains(@id,'post_message')]")

score 1 · Accepted Answer

少し前にそれを理解しました。htmlParseにはisURL=TRUEが必要です。

getxpath <- function(xt,xpath){
  require(XML);require(tm)
  x <- htmlParse(file=u,isURL=TRUE)
  resultvector <- xpathSApply(x,xpath,xmlValue)
  result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=resultvector))
  return(result)
}

res <- getxpath("http://url.com/board.html","//xpath")

すべてのファイルを取得するには、list.filesを使用してファイルリストを取得し、Map / clusterMapとgetxpath（）を使用してファイルをリストに入れ、do.callを使用してベクターに入れ、Corpus（VectorSource（res））を使用してそれらをコーパスに入れます。

xml - RCorpusの各ドキュメントのXPath

1 に答える 1

Related

Reference