scala - scala 文字列を RDD[seq[string]] に変換します

Question

 // 4 workers
  val sc = new SparkContext("local[4]", "naivebayes")

  // Load documents (one per line).
  val documents: RDD[Seq[String]] = sc.textFile("/tmp/test.txt").map(_.split(" ").toSeq)

  documents.zipWithIndex.foreach{
  case (e, i) =>
  val collectedResult = Tokenizer.tokenize(e.mkString)
  }

  val hashingTF = new HashingTF()
  //pass collectedResult instead of document
  val tf: RDD[Vector] = hashingTF.transform(documents)

  tf.cache()
  val idf = new IDF().fit(tf)
  val tfidf: RDD[Vector] = idf.transform(tf)

上記のコードスニペットでは、collectedResult を抽出して hashingTF.transform に再利用したいと考えています。トークン化関数の署名がある場所でこれを実現するにはどうすればよいですか

 def tokenize(content: String): Seq[String] = {
...
}

score 1 · Accepted Answer

mapではなく、したいようですforeach。あなたが何zipWithIndexのために使っているのか、なぜあなたがsplit自分の回線を呼び出して、もう一度mkString.

val lines: Rdd[String] = sc.textFile("/tmp/test.txt")
val tokenizedLines = lines.map(tokenize)
val hashes = tokenizedLines.map(hashingTF)
hashes.cache()
...

scala - scala 文字列を RDD[seq[string]] に変換します

1 に答える 1

Related

Reference