scala - Scala と Spark でのテキストの見出し語化の最も簡単な方法

Question

テキストファイルでレンマタイゼーションを使用したい:

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring 2008 issue moody audio backed.

omg left gotta wrap review order asap . understand hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

cables cables finally able hear gem long rumored music .
...

期待される出力は次のとおりです。

surprise heard thump open door small seed man clasp package wrap.

upgrade system found review spring 2008 issue mood audio back.

omg left gotta wrap review order asap . understand hand deliver dali lama

speak hand wear earplug live . listen maintain link long .

cable cable final able hear gem long rumor music .
...

誰でも私を助けることができますか？そして、Scala と Spark で実装されている見出し語化の最も簡単な方法を誰が知っていますか?

score 7 · Accepted Answer

本からの関数があります。Spark の高度な分析、レンマタイゼーションに関する章:

  val plainText =  sc.parallelize(List("Sentence to be precessed."))

  val stopWords = Set("stopWord")

  import edu.stanford.nlp.pipeline._
  import edu.stanford.nlp.ling.CoreAnnotations._
  import scala.collection.JavaConversions._

  def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))
  lemmatized.foreach(println)

これをマッパーのすべての行に使用します。

val lemmatized = plainText.map(plainTextToLemmas(_, stopWords))

編集：

コード行に追加しました

import scala.collection.JavaConversions._

そうしないと文は Java ではなく Scala List になるため、これが必要です。これで問題なくコンパイルできるはずです。

私は scala 2.10.4 と以下の stanford.nlp 依存関係を使用しました:

<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
</dependency>
<dependency>
  <groupId>edu.stanford.nlp</groupId>
  <artifactId>stanford-corenlp</artifactId>
  <version>3.5.2</version>
  <classifier>models</classifier>
</dependency>

stanford.nlp ページにも多くの例があります (Java で) http://nlp.stanford.edu/software/corenlp.shtmlを見ることができます。

編集：

マップパーティションのバージョン:

仕事が大幅にスピードアップするかどうかはわかりませんが。

  def plainTextToLemmas(text: String, stopWords: Set[String], pipeline: StanfordCoreNLP): Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 2 && !stopWords.contains(lemma)) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }

  val lemmatized = plainText.mapPartitions(p => {
    val props = new Properties()
    props.put("annotators", "tokenize, ssplit, pos, lemma")
    val pipeline = new StanfordCoreNLP(props)
    p.map(q => plainTextToLemmas(q, stopWords, pipeline))
  })
  lemmatized.foreach(println)

scala - Scala と Spark でのテキストの見出し語化の最も簡単な方法

3 に答える 3

Related

Reference