r - Twitter データ分析 - Term Document Matrix のエラー

Question

ツイッターのデータを分析してみました。ツイートをダウンロードし、以下を使用してツイートのテキストからコーパスを作成しました

# Creating a Corpus
wim_corpus = Corpus(VectorSource(wimbledon_text))

以下のように TermDocumentMatrix を作成しようとすると、エラーと警告が表示されます。

tdm = TermDocumentMatrix(wim_corpus, 
                       control = list(removePunctuation = TRUE, 
                                      stopwords =  TRUE, 
                                      removeNumbers = TRUE, tolower = TRUE)) 

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),    : 'i, j, v' different lengths


In addition: Warning messages:
1: In parallel::mclapply(x, termFreq, control) :
 all scheduled cores encountered errors in user code
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In TermDocumentMatrix.VCorpus(corpus) : invalid document identifiers
4: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
NAs introduced by coercion

誰でもこのエラーが何を示しているか指摘できますか?これは tm パッケージに関連している可能性がありますか?

tm ライブラリがインポートされました。R バージョン: R 3.0.1 および RStudio: 0.97 を使用しています。

score 11 · Accepted Answer

私は同じ問題を抱えていましたが、パッケージの互換性の問題であることがわかりました。インストールしてみる

install.packages("SnowballC")

そしてロード

library(SnowballC)

DocumentTermMatrix を呼び出す前に。

それは私の問題を解決しました。

score 7 · Accepted Answer

I think the error is due to some "exotic" characters within the tweet messages, which the tm function cannot handle. I'v got the same error using tweets as a corpus source. Maybe the following workaround helps:

# Reading some tweet messages (here from a text file) into a vector

rawTweets <- readLines(con = "target_7_sample.txt", ok = TRUE, warn = FALSE, encoding = "utf-8")

# Convert the tweet text explicitly into utf-8

convTweets <- iconv(rawTweets, to = "utf-8")

# The above conversion leaves you with vector entries "NA", i.e. those tweets that can't be handled. Remove the "NA" entries with the following command:

tweets <- (convTweets[!is.na(convTweets)])

If the deletion of some tweets is not an issue for your solution (e.g. build a word cloud) then this approach may work, and you can proceed by calling the Corpus function of the tm package.

Regards--Albert

score 6 · Accepted Answer

TM に関する記事で、この問題を解決する方法を見つけました。

エラーが以下に続く例：

getwd()
require(tm)

# Importing files
files <- DirSource(directory = "texts/",encoding ="latin1" )

# loading files and creating a Corpus
corpus <- VCorpus(x=files)

# Summary

summary(corpus)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation)
matrix_terms <- DocumentTermMatrix(corpus)

Warning messages:
In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers

このエラーは、Term Document Matrix を実行するためにクラス Vector Source のオブジェクトが必要であるために発生しますが、以前の変換によってテキストのコーパスが文字どおりに変換され、関数によって受け入れられないクラスが変更されます。

ただし、関数 TermDocumentMatrix を使用する前にもう 1 つコマンドを追加すると、続行できます。

以下は、新しいコマンドを含むコードに従います。

getwd()
require(tm)  

files <- DirSource(directory = "texts/",encoding ="latin1" )

# loading files and creating a Corpus
corpus <- VCorpus(x=files)

# Summary 
summary(corpus)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation)

# COMMAND TO CHANGE THE CLASS AND AVOID THIS ERROR
corpus <- Corpus(VectorSource(corpus))
matriz_terms <- DocumentTermMatrix(corpus)

したがって、これでこれ以上問題が発生することはありません。

score 3 · Accepted Answer

Albertが示唆したように、テキストエンコーディングを「utf-8」に変換すると、問題が解決しました。しかし、問題のある文字を含むツイート全体を削除する代わりに、iconv でsubオプションを使用して、ツイート内の「悪い」文字のみを削除し、残りを保持することができます。

tweets <- iconv(rawTweets, to = "utf-8", sub="")

これにより NA は生成されなくなり、それ以上のろ過ステップは必要ありません。

score 0 · Accepted Answer

エラーの原因となったドイツ語のウムラウト文字と特殊なフォントがいくつかありました。utf-8に変換しても、Rでそれらを削除できませんでした。（私は新しいRユーザーです）ので、Excelを使用してドイツ語の文字を削除しましたが、その後エラーはありませんでした..

r - Twitter データ分析 - Term Document Matrix のエラー

6 に答える 6

Related

Reference