2

R でtopicmodelsチュートリアルを進めています。12ページあたりで、HTML タグとギリシャ文字が削除されます。

R> library("XML")
R> remove_HTML_markup <- function(s) {
+ doc <- htmlTreeParse(s, asText = TRUE, trim = FALSE)
+ xmlValue(xmlRoot(doc))
+ }
R> remove_HTML_markup(JSS_papers[1,"description"])
Error: XML content does not seem to be XML, nor to identify a file name ...

JSS_papersジャーナルからダウンロードした論文集に関連するメタデータを保存します。タグの下のエントリdescriptionは、記事の要約です。これにはタグがありません:

JSS_papers[1,"description"] = "The fit of a variogram model to spatially-distributed 
    data is often difficult to assess. A graphical diagnostic written in S-plus is   
    introduced that allows the user to determine both the general quality of the fit of a 
    variogram model, and to find specific pairs of locations that do not have measurements 
    that are consonant with the fitted variogram. It can help identify nonstationarity,    
    outliers, and poor variogram fit in general. Simulated data sets and a set of soil      
    nitrogen concentration data are examined using this graphical diagnostic."
4

1 に答える 1