r - テキストをクリーンアップする R の TM パッケージ

Question

R で TM パッケージを使用してテキストコーパスをクリーンアップしようとしていますが、このエラーが引き続き発生します。

no applicable method for 'removePunctuation' applied to an object of class "data.frame"

私のデータは、テキストファイルから読み取ったチャットログで構成されており、R では次のようになります。

     V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.

私が使う：

tdm <- TermDocumentMatrix(text,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

しかし、私はこのエラーが発生します:

Error in UseMethod("TermDocumentMatrix", x) : 
  no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"

データフレームを関数にフィードすることになっていないようですが、他にどうすればよいですか?

ありがとう

score 1 · Accepted Answer

@Martin Belが指摘したように、qdapバージョン1.1.0でもこれを行うことができます。tdmここでうまく機能する機能を含む tm パッケージとの互換性を高めるために、qdap に少しサポートを追加しました。

最初にデータを読み込みます（コロンを追加しました）：

library(qdap)
dat <- read.transcript(text="ID    V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.", header=TRUE, sep="   ")

# 用語ドキュメントマトリックスを作成するには:

tdm(dat$V1, id(dat), stopwords=tm::stopwords("en"))

# tm パッケージで同じことを行うには:

TermDocumentMatrix(Corpus(VectorSource(dat[, 1])),
    control = list(
        removePunctuation = TRUE,
        stopwords = TRUE
    )
)

score 1 · Accepted Answer

あなたは非常に近いです。最も簡単な方法は、を使用DataframeSourceしてコーパスオブジェクトを作成し、そこから用語ドキュメントマトリックスを作成することです。あなたの例を使用して：

データを入力してみましょう...

Text <- readLines(n=4)
In the process
Sorry I had to step away for a moment.
I am getting an error page that says QB is currently unavailable.
That link gives me the same error message.

df <- data.frame(V1 = Text, stringsAsFactors = FALSE)

次に、データフレームを Term Document Matrix に変換します...

require(tm)
mycorpus <- Corpus(DataframeSource(df))
tdm <- TermDocumentMatrix(mycorpus, control = list(removePunctuation = TRUE, stopwords = TRUE))

次に、出力を調べます...

inspect(tdm)
   A term-document matrix (14 terms, 4 documents)

Non-/sparse entries: 15/41
Sparsity           : 73%
Maximal term length: 11 
Weighting          : term frequency (tf)

             Docs
Terms         1 2 3 4
  away        0 1 0 0
  currently   0 0 1 0
  error       0 0 1 1
  getting     0 0 1 0
  gives       0 0 0 1
  link        0 0 0 1
  message     0 0 0 1
  moment      0 1 0 0
  page        0 0 1 0
  process     1 0 0 0
  says        0 0 1 0
  sorry       0 1 0 0
  step        0 1 0 0
  unavailable 0 0 1 0

score -1 · Accepted Answer

次のようにして、データフレームからテキストを展開するだけですtext[,1]。

tdm <- TermDocumentMatrix(text[,1],
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

r - テキストをクリーンアップする R の TM パッケージ

3 に答える 3

Related

Reference