r - テキストマイニングPDFファイル/単語頻度の問題

Question

豊富な pdf エンコーディングとグラフを使用して、記事の pdf をマイニングしようとしています。いくつかのPDFドキュメントをマイニングすると、高頻度の単語がphi、taeoe、toe、sigma、gammaなどになることに気付きました。一部のpdfドキュメントではうまく機能しますが、他のランダムなギリシャ文字を取得します。これは文字エンコーディングの問題ですか？（ちなみに、すべてのドキュメントは英語です）。助言がありますか？

# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
 pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                              language = "en",
                                              id = "id1")
 content(pdf)[1:4]
 }


docs<- Corpus(URISource(uri, mode = ""),
    readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 

library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  

dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))   
length(freq)  
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)] 
freq[tail(ord)]

score 0 · Accepted Answer

ghostscriptそれがここですべての問題を引き起こしていると思います。pdfinfoとが適切にインストールされていると仮定するとpdftotext、このコードは、あなたが言及した奇妙な単語を生成することなく機能します。

library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                               language = "en",
                                               id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 
docs <- tm_map(docs, removePunctuation) 
library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))

ワードクラウドを使用して、PDF ファイルで最も頻繁に使用される単語の結果を視覚化できます。

library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))

明らかに、この結果は完全ではありません。主な理由は、単語のステミングで 100% 信頼できる結果が得られることはめったにないためです (たとえば、「問題」と「問題」は別の単語として、または「方法」と「方法」はまだあります)。SnowballCかなり良い仕事をしているにもかかわらず、私は R で間違いのないステミングアルゴリズムを認識していません。

r - テキストマイニングPDFファイル/単語頻度の問題

1 に答える 1

Related

Reference