r - dfm() 出力に ID 番号を含める

Question

ID 番号列とテキスト列を含むデータセットがあり、quantedaパッケージを使用してテキストデータに対して LIWC 分析を実行しています。これが私のデータ設定の例です：

mydata<-data.frame(
  id=c(19,101,43,12),
  text=c("No wonder, then, that ever gathering volume from the mere transit ",
         "So that in many cases such a panic did he finally strike, that few ",
         "But there were still other and more vital practical influences at work",
         "Not even at the present day has the original prestige of the Sperm Whale"),
  stringsAsFactors=F
)

を使用してLIWC分析を実行できましたscores <- dfm(as.character(mydata$text), dictionary = liwc)

しかし、結果 ( View(scores)) を表示すると、最終結果で関数が元の ID 番号 (19、101、43、12) を参照していないことがわかります。代わりに、row.names列が含まれていますが、説明的でない識別子 (例: "text1"、"text2") が含まれています。

dfm()関数の出力に ID 番号を含めるにはどうすればよいですか? ありがとうございました！

score 1 · Accepted Answer

dfm オブジェクトの行名をの ID 番号にしたいようですmydata$id。この ID をテキストの docnames として宣言すると、これは自動的に行われます。これを行う最も簡単な方法は、data.frame から quanteda コーパスオブジェクトを作成することです。

以下のcorpus()呼び出しは、変数から docnames を割り当てidます。注: 呼び出しの「テキスト」summary()は数値のように見えますが、実際にはテキストのドキュメント名です。

require(quanteda)
myCorpus <- corpus(mydata[["text"]], docnames = mydata[["id"]])
summary(myCorpus)
# Corpus consisting of 4 documents.
# 
# Text Types Tokens Sentences
#   19    11     11         1
#  101    13     14         1
#   43    12     12         1
#   12    12     14         1
# 
# Source:  /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Tue Dec 29 11:54:00 2015
# Notes:

そこから、ドキュメント名が自動的に dfm の行ラベルになります。( dictionary =LIWC アプリケーションの引数を追加できます。)

myDfm <- dfm(myCorpus, verbose = FALSE)
head(myDfm)
# Document-feature matrix of: 4 documents, 45 features.
# (showing first 4 documents and first 6 features)
#      features
# docs  no wonder then that ever gathering
#   19   1      1    1    1    1         1
#   101  0      0    0    2    0         0
#   43   0      0    0    0    0         0
#   12   0      0    0    0    0         0

r - dfm() 出力に ID 番号を含める

1 に答える 1

Related

Reference