r - N-gram をコーパスに実装する Quanteda Error

Question

R のコーパスに quanteda を実装しようとしていますが、次のようになっています。

Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE,  : 
  duplicate row.names: character(0)

私はこれについてあまり経験がありません。データセットのダウンロードは次のとおりです: https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0

コードは次のとおりです。

tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)

quanteda.corpus <- corpus(corpus)

score 1 · Accepted Answer

あなたが tm で行っている処理は、tm 用のオブジェクトを準備することであり、quanteda はそれをどう処理すればよいかわかりません... quanteda はこれらすべてのステップを自分自身で行い、help("dfm")、オプション。

次のことを試してみると、先に進むことができます。

dfm(tweets$Tweet, verbose = TRUE, toLower= TRUE, removeNumbers = TRUE, removePunct = TRUE,removeTwitter = TRUE, language = "english",ignoredFeatures=stopwords("english"), stem=TRUE)

文字ベクトルから dfm を作成 ... ... 小文字化 ... トークン化 ... ドキュメントのインデックス作成: 6,943 ドキュメント ... フィーチャのインデックス作成: 15,164 のフィーチャタイプ ... 174 の提供された (グロブ) フィーチャタイプから 161 のフィーチャを削除...ステミング機能 (英語)、トリミングされた 2175 の機能バリアント ... 6943 x 12828 スパース dfm を作成 ... 完了。経過時間: 0.756 秒。HTH

score 1 · Accepted Answer

tmパッケージから始める必要はなくread.csv()、まったく使用する必要もありません。これがquantedaコンパニオンパッケージreadtextの目的です。

readtext::readtext()したがって、データを読み込むには、作成したオブジェクトを直接コーパスコンストラクターに送信します。

myCorpus <- corpus(readtext("~/Downloads/TwitterSelfDriveShrink.csv", text_field = "Tweet"))
summary(myCorpus, 5)
## Corpus consisting of 6943 documents, showing 5 documents.
## 
## Text Types Tokens Sentences Sentiment Sentiment_Confidence
## text1    19     21         1         2               0.7579
## text2    18     20         2         2               0.8775
## text3    23     24         1        -1               0.6805
## text5    17     19         2         0               1.0000
## text4    18     19         1        -1               0.8820
## 
## Source:  /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Apr 14 09:22:11 2016
## Notes:

そこから、dfm()ngram の選択を含め、すべての前処理ステムを呼び出しで直接実行できます。

# just unigrams
dfm1 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 15,577 feature types
## ... removed 161 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 2174 feature variants
## ... created a 6943 x 13242 sparse dfm
## ... complete. 
## Elapsed time: 0.662 seconds.

# just bigrams
dfm2 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"), ngrams = 2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 52,433 feature types
## ... removed 24,002 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 572 feature variants
## ... created a 6943 x 27859 sparse dfm
## ... complete. 
## Elapsed time: 1.419 seconds.

r - N-gram をコーパスに実装する Quanteda Error

2 に答える 2

Related

Reference