stanford-nlp - フレーズを構成する特定の名前付きエンティティトークンのセットを取得することは可能ですか?

Question

スタンフォード CoreNLP パーサーを使用してテキストを実行していますが、「10 月の第 2 月曜日」や「過去 1 年」などの日付フレーズがあります。ライブラリは各トークンを DATE という名前のエンティティとして適切にタグ付けしますが、この日付フレーズ全体をプログラムで取得する方法はありますか? 日付だけではなく、ORGANIZATION という名前のエンティティも同じことを行います (たとえば、「国際オリンピック委員会」は、特定のテキストの例で識別される可能性があります)。

String content = "Thanksgiving, or Thanksgiving Day (Canadian French: Jour de"
        + " l'Action de grâce), occurring on the second Monday in October, is"
        + " an annual Canadian holiday which celebrates the harvest and other"
        + " blessings of the past year.";

Properties p = new Properties();
p.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(p);

Annotation document = new Annotation(content);
pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
    for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {

        String word = token.get(CoreAnnotations.TextAnnotation.class);
        String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

        if (ne.equals("DATE")) {
            System.out.println("DATE: " + word);
        }

    }
}

スタンフォードアノテーターと分類子の読み込み後、次の出力が生成されます。

DATE: Thanksgiving
DATE: Thanksgiving
DATE: the
DATE: second
DATE: Monday
DATE: in
DATE: October
DATE: the
DATE: past
DATE: year

ライブラリはフレーズを認識し、それらを名前付きエンティティのタグ付けに使用する必要があるように感じます。そのため、問題は、データが保持され、API を介して何らかの方法で利用できるかということですか?

ありがとう、ケビン

score 1 · Accepted Answer

メーリングリストで話し合った結果、APIがこれをサポートしていないことがわかりました。私の解決策は、最後のNEの状態を保持し、必要に応じて文字列を作成することでした。nlpメーリングリストのJohnB.は、私の質問に答えるのに役立ちました。

score 0 · Accepted Answer

名前付きエンティティタガーと品詞タガーは、CoreNLP パイプライン内の別個のアルゴリズムであり、API コンシューマはそれらを統合する任務を負っているようです。

私のC＃を許してください、しかしここに簡単なクラスがあります：

    public class NamedNounPhrase
    {
        public NamedNounPhrase()
        {
            Phrase = string.Empty;
            Tags = new List<string>();
        }

        public string Phrase { get; set; }

        public IList<string> Tags { get; set; }

    }

そして、最上位の名詞句とそれに関連付けられた名前付きエンティティタグをすべて検索するためのコード:

    private void _monkey()
    {

        ...

        var nounPhrases = new List<NamedNounPhrase>();

        foreach (CoreMap sentence in sentences.toArray())
        {
            var tree =
                (Tree)sentence.get(new TreeCoreAnnotations.TreeAnnotation().getClass());

            if (null != tree)
                _walk(tree, nounPhrases);
        }

        foreach (var nounPhrase in nounPhrases)
            Console.WriteLine(
                "{0} ({1})",
                nounPhrase.Phrase,
                string.Join(", ", nounPhrase.Tags)
                );
    }

    private void _walk(Tree tree, IList<NamedNounPhrase> nounPhrases)
    {
        if ("NP" == tree.value())
        {
            var nounPhrase = new NamedNounPhrase();

            foreach (Tree leaf in tree.getLeaves().toArray())
            {
                var label = (CoreLabel) leaf.label();
                nounPhrase.Phrase += (string) label.get(new CoreAnnotations.TextAnnotation().getClass()) + " ";
                nounPhrase.Tags.Add((string) label.get(new CoreAnnotations.NamedEntityTagAnnotation().getClass()));
            }

            nounPhrases.Add(nounPhrase);
        }
        else
        {
            foreach (var child in tree.children())
            {
                _walk(child, nounPhrases);
            }
        }
    }

それが役立つことを願っています!

score 0 · Accepted Answer

Thanks a lot, I was going to do the same. The Stanford NER API, however, supports classifyToCharOffset (or something like that) to get the whole phrase. I don't know, maybe it is just an implementation of your idea :D.

stanford-nlp - フレーズを構成する特定の名前付きエンティティ トークンのセットを取得することは可能ですか?

3 に答える 3

Related

Reference

stanford-nlp - フレーズを構成する特定の名前付きエンティティトークンのセットを取得することは可能ですか?