r - R テキストマイニングパッケージ: 新しいドキュメントを既存のコーパスに組み込むことができます

Question

Rのテキストマイニングパッケージに次の機能がある可能性があるかどうか疑問に思っていました:

myCorpus <- Corpus(DirSource(<directory-contatining-textfiles>),control=...)
# add docs
myCorpus.addDocs(DirSource(<new-dir>),control=...)

理想的には、既存のコーパスに追加のドキュメントを組み込みたいと考えています。

どんな助けでも大歓迎です

score 11 · Accepted Answer

次のように使用できるはずc(,)です

> library(tm)
> data("acq")
> data("crude")
> together <- c(acq,crude)
> acq
A corpus with 50 text documents
> crude
A corpus with 20 text documents
> together
A corpus with 70 text documents

詳細については、以下のtm パッケージドキュメントを参照してtm_combineください。

score 0 · Accepted Answer

ビッグデータテキストマイニングセットのコンテキストでも、この問題を克服します。データセット全体を一度にロードすることはできませんでした。

ここでは、そのようなビッグデータセットの別のオプションが可能です。このアプローチは、ループ内で 1 つの文書コーパスのベクトルを収集することです。このようにすべてのドキュメントを処理した後、このベクターを 1 つの巨大なコーパスに変換して、DTM を作成することができます。

# Vector to collect the corpora:
webCorpusCollection <- c()

# Loop over raw data:
for(i in ...) {

  try({      

    # Convert one document into a corpus:
    webDocument <- Corpus(VectorSource(iconv(webDocuments[i,1], "latin1", "UTF-8")))

    #
    # Do other things e.g. preprocessing...
    #

    # Store this document into the corpus vector:
    webCorpusCollection <- rbind(webCorpusCollection, webDocument)

  })
}

# Collecting done. Create one huge corpus:
webCorpus <- Corpus(VectorSource(unlist(webCorpusCollection[,"content"])))

r - R テキスト マイニング パッケージ: 新しいドキュメントを既存のコーパスに組み込むことができます

2 に答える 2

Related

Reference

r - R テキストマイニングパッケージ: 新しいドキュメントを既存のコーパスに組み込むことができます