java - 英語以外の言語を Stanford Parser でパースする方法は? コマンドラインではなく Java で

Question

Java プログラムでスタンフォードパーサーを使用して、中国語の文を解析しようとしています。私は Java とスタンフォードパーサーの両方にまったく慣れていないので、'ParseDemo.java' を使用して練習しました。このコードは英語の文で問題なく動作し、正しい結果を出力します。しかし、モデルを 'chinesePCFG.ser.gz' に変更して、いくつかの分割された中国語の文を解析しようとすると、問題が発生しました。

これがJavaでの私のコードです

class ParserDemo {

  public static void main(String[] args) {
    LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");
    if (args.length > 0) {
      demoDP(lp, args[0]);
    } else {
      demoAPI(lp);
    }
  }

  public static void demoDP(LexicalizedParser lp, String filename) {
    // This option shows loading and sentence-segment and tokenizing
    // a file using DocumentPreprocessor
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    // You could also create a tokenier here (as below) and pass it
    // to DocumentPreprocessor
    for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
      Tree parse = lp.apply(sentence);
      parse.pennPrint();
      System.out.println();

      GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
      Collection tdl = gs.typedDependenciesCCprocessed(true);
      System.out.println(tdl);
      System.out.println();
    }
  }

  public static void demoAPI(LexicalizedParser lp) {
    // This option shows parsing a list of correctly tokenized words
    String sent[] = { "我", "是", "一名", "学生" };
    List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
    Tree parse = lp.apply(rawWords);
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }

  private ParserDemo() {} // static methods only
}

これは基本的に ParserDemo.java と同じですが、実行すると次の結果が得られます。

シリアル化されたファイル edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz からパーサーをロードしています ... 完了 [2.2 秒]。(ROOT (IP (NP (PN 我)) (VP (VC 是) (NP (QP (CD 一名)) (NP (NN 学生))))))

スレッド「メイン」の例外 java.lang.RuntimeException: edu.stanford.nlp.trees.GrammaticalStructureFactory.newGrammaticalStructure(GrammaticalStructureFactory) で public edu.stanford.nlp.trees.EnglishGrammaticalStructure(edu.stanford.nlp.trees.Tree) を呼び出せませんでした.java:104) で parserdemo.ParserDemo.demoAPI(ParserDemo.java:65) で parserdemo.ParserDemo.main(ParserDemo.java:23) で

65行目のコードは次のとおりです。

 GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);

私の推測では、chinesePCFG.ser.gz には「edu.stanford.nlp.trees.EnglishGrammaticalStructure」に関連する何かが欠けていると思います。パーサーはコマンドライン経由で中国語を正しく解析するため、自分のコードに何か問題があるに違いありません。私は検索してきましたが、適切なモデルの使用について言及されている同様のケースがいくつか見つかりましたが、コードを「適切なモデル」に変更する方法が本当にわかりません。誰かが私を助けてくれることを願っています。私は Java とスタンフォードパーサーの初心者なので、具体的に教えてください。ありがとうございました！

score 2 · Accepted Answer

The problem is that the GrammaticalStructureFactory is constructed from a PennTreebankLanguagePack, which is for the English Penn Treebank. You need to use (in two places)

TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();

and to import this appropriately

import edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack;

But we also generally recommend using the factored parser for Chinese (since it works considerably better, unlike for English, although at the cost of more memory and time usage)

LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz");

java - 英語以外の言語を Stanford Parser でパースする方法は? コマンドラインではなく Java で

1 に答える 1

Related

Reference