nlp - Stanford NamedEntityRecognizerの複数用語の名前付きエンティティ

Question

Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtmlを使用していますが、正常に機能しています。これは

    List<List<CoreLabel>> out = classifier.classify(text);
    for (List<CoreLabel> sentence : out) {
        for (CoreLabel word : sentence) {
            if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
                namedEntities.add(word.word().trim());           
            }
        }
    }

しかし、私が見つけた問題は、名前と名前を識別することです。レコグナイザーが「JoeSmith」に遭遇した場合、「Joe」と「Smith」を別々に返します。「ジョー・スミス」を一言で返して欲しいです。

これは、おそらく構成を介してレコグナイザーを介して達成できますか？これまで、javadocには何も見つかりませんでした。

ありがとう！

score 20 · Accepted Answer

これは、内側のforループが個々のトークン（単語）を反復処理し、それらを個別に追加しているためです。一度に名前全体を追加するには、変更する必要があります。

1つの方法は、内側のforループを、同じクラスの隣接する非Oのものを取得し、それらを単一のエンティティとして追加するwhileループを持つ通常のforループに置き換えることです。*

別の方法は、CRFClassifierメソッド呼び出しを使用することです。

List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)

これにより、エンティティ全体が得られsubstring、元の入力で使用して文字列形式を抽出できます。

*配布するモデルは、単純な生のIOラベルスキームを使用しており、PERSONまたはLOCATIONのラベルが付けられています。適切な方法は、隣接するトークンを同じラベルで合体させることです。多くのNERシステムは、IOBラベルなどのより複雑なラベルを使用します。この場合、B-PERSなどのコードは、個人エンティティの開始位置を示します。CRFClassifierクラスと機能ファクトリはそのようなラベルをサポートしていますが、現在配布しているモデル（2012年現在）では使用されていません。

score 5 · Accepted Answer

classifyToCharacterOffsetsメソッドに対応するのは、（AFAIK）エンティティのラベルにアクセスできないことです。

クリストファーによって提案されたように、これは「隣接する非Oのもの」を組み立てるループの例です。この例では、発生回数もカウントします。

public HashMap<String, HashMap<String, Integer>> extractEntities(String text){

    HashMap<String, HashMap<String, Integer>> entities =
            new HashMap<String, HashMap<String, Integer>>();

    for (List<CoreLabel> lcl : classifier.classify(text)) {

        Iterator<CoreLabel> iterator = lcl.iterator();

        if (!iterator.hasNext())
            continue;

        CoreLabel cl = iterator.next();

        while (iterator.hasNext()) {
            String answer =
                    cl.getString(CoreAnnotations.AnswerAnnotation.class);

            if (answer.equals("O")) {
                cl = iterator.next();
                continue;
            }

            if (!entities.containsKey(answer))
                entities.put(answer, new HashMap<String, Integer>());

            String value = cl.getString(CoreAnnotations.ValueAnnotation.class);

            while (iterator.hasNext()) {
                cl = iterator.next();
                if (answer.equals(
                        cl.getString(CoreAnnotations.AnswerAnnotation.class)))
                    value = value + " " +
                           cl.getString(CoreAnnotations.ValueAnnotation.class);
                else {
                    if (!entities.get(answer).containsKey(value))
                        entities.get(answer).put(value, 0);

                    entities.get(answer).put(value,
                            entities.get(answer).get(value) + 1);

                    break;
                }
            }

            if (!iterator.hasNext())
                break;
        }
    }

    return entities;
}

score 3 · Accepted Answer

私も同じ問題を抱えていたので、調べてみました。クリストファー・マニングによって提案された方法は効率的ですが、微妙な点は、どの種類のセパレーターが適切であるかを決定する方法を知ることです。スペースのみを許可する必要があると言うことができます。たとえば、「JohnZorn」>>1つのエンティティ。ただし、「J.Zorn」という形式が見つかる場合があるため、特定の句読点も許可する必要があります。しかし、「ジャック、ジェームス、ジョー」はどうですか？3つではなく2つのエンティティ（「JackJames」と「Joe」）を取得する可能性があります。

スタンフォードNERクラスを少し掘り下げることで、私は実際にこのアイデアの適切な実装を見つけました。String彼らはそれを使用して、単一のオブジェクトの形式でエンティティをエクスポートします。たとえば、メソッドPlainTextDocumentReaderAndWriter.printAnswersTokenizedInlineXMLでは、次のようになります。

 private void printAnswersInlineXML(List<IN> doc, PrintWriter out) {
    final String background = flags.backgroundSymbol;
    String prevTag = background;
    for (Iterator<IN> wordIter = doc.iterator(); wordIter.hasNext();) {
      IN wi = wordIter.next();
      String tag = StringUtils.getNotNullString(wi.get(AnswerAnnotation.class));

      String before = StringUtils.getNotNullString(wi.get(BeforeAnnotation.class));

      String current = StringUtils.getNotNullString(wi.get(CoreAnnotations.OriginalTextAnnotation.class));
      if (!tag.equals(prevTag)) {
        if (!prevTag.equals(background) && !tag.equals(background)) {
          out.print("</");
          out.print(prevTag);
          out.print('>');
          out.print(before);
          out.print('<');
          out.print(tag);
          out.print('>');
        } else if (!prevTag.equals(background)) {
          out.print("</");
          out.print(prevTag);
          out.print('>');
          out.print(before);
        } else if (!tag.equals(background)) {
          out.print(before);
          out.print('<');
          out.print(tag);
          out.print('>');
        }
      } else {
        out.print(before);
      }
      out.print(current);
      String afterWS = StringUtils.getNotNullString(wi.get(AfterAnnotation.class));

      if (!tag.equals(background) && !wordIter.hasNext()) {
        out.print("</");
        out.print(tag);
        out.print('>');
        prevTag = background;
      } else {
        prevTag = tag;
      }
      out.print(afterWS);
    }
  }

前に説明したように、各単語を繰り返し処理し、前の単語と同じクラス（回答）があるかどうかを確認します。このために、エンティティではないと見なされるファクト式に、いわゆるbackgroundSymbol（クラス「O」）を使用してフラグが立てられるという事実を利用します。また、プロパティを使用しますBeforeAnnotation。これは、現在の単語を前の単語から分離する文字列を表します。この最後の点により、適切なセパレーターの選択に関して、私が最初に提起した問題を解決することができます。

score 2 · Accepted Answer

上記のコード：

<List> result = classifier.classifyToCharacterOffsets(text);

for (Triple<String, Integer, Integer> triple : result)
{
    System.out.println(triple.first + " : " + text.substring(triple.second, triple.third));
}

score 2 · Accepted Answer

List<List<CoreLabel>> out = classifier.classify(text);
for (List<CoreLabel> sentence : out) {
    String s = "";
    String prevLabel = null;
    for (CoreLabel word : sentence) {
      if(prevLabel == null  || prevLabel.equals(word.get(CoreAnnotations.AnswerAnnotation.class)) ) {
         s = s + " " + word;
         prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
      }
      else {
        if(!prevLabel.equals("O"))
           System.out.println(s.trim() + '/' + prevLabel + ' ');
        s = " " + word;
        prevLabel = word.get(CoreAnnotations.AnswerAnnotation.class);
      }
    }
    if(!prevLabel.equals("O"))
        System.out.println(s + '/' + prevLabel + ' ');
}

小さなロジックを書いただけで、正常に機能しています。私がしたのは、同じラベルの単語が隣接している場合はそれらをグループ化することです。

score 1 · Accepted Answer

すでに提供されている分類子を利用してください。私はこれがあなたが探しているものだと信じています：

    private static String combineNERSequence(String text) {

    String serializedClassifier = "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz";      
    AbstractSequenceClassifier<CoreLabel> classifier = null;
    try {
        classifier = CRFClassifier
                .getClassifier(serializedClassifier);
    } catch (ClassCastException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    System.out.println(classifier.classifyWithInlineXML(text));

    //  FOR TSV FORMAT  //
    //System.out.print(classifier.classifyToString(text, "tsv", false));

    return classifier.classifyWithInlineXML(text);
}

score 0 · Accepted Answer

これが私の完全なコードです。StanfordコアNLPを使用し、アルゴリズムを記述してマルチターム名を連結します。

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import org.apache.log4j.Logger;

import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

/**
 * Created by Chanuka on 8/28/14 AD.
 */
public class FindNameEntityTypeExecutor {

private static Logger logger = Logger.getLogger(FindNameEntityTypeExecutor.class);

private StanfordCoreNLP pipeline;

public FindNameEntityTypeExecutor() {
    logger.info("Initializing Annotator pipeline ...");

    Properties props = new Properties();

    props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");

    pipeline = new StanfordCoreNLP(props);

    logger.info("Annotator pipeline initialized");
}

List<String> findNameEntityType(String text, String entity) {
    logger.info("Finding entity type matches in the " + text + " for entity type, " + entity);

    // create an empty Annotation just with the given text
    Annotation document = new Annotation(text);

    // run all Annotators on this text
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
    List<String> matches = new ArrayList<String>();

    for (CoreMap sentence : sentences) {

        int previousCount = 0;
        int count = 0;
        // traversing the words in the current sentence
        // a CoreLabel is a CoreMap with additional token-specific methods

        for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
            String word = token.get(CoreAnnotations.TextAnnotation.class);

            int previousWordIndex;
            if (entity.equals(token.get(CoreAnnotations.NamedEntityTagAnnotation.class))) {
                count++;
                if (previousCount != 0 && (previousCount + 1) == count) {
                    previousWordIndex = matches.size() - 1;
                    String previousWord = matches.get(previousWordIndex);
                    matches.remove(previousWordIndex);
                    previousWord = previousWord.concat(" " + word);
                    matches.add(previousWordIndex, previousWord);

                } else {
                    matches.add(word);
                }
                previousCount = count;
            }
            else
            {
                count=0;
                previousCount=0;
            }


        }

    }
    return matches;
}
}

score 0 · Accepted Answer

マルチワードエンティティを処理するための別のアプローチ。このコードは、複数のトークンが同じアノテーションを持ち、連続している場合、それらを組み合わせます。

制約事項：
同じトークンに2つの異なる注釈がある場合、最後の注釈が保存されます。

private Document getEntities(String fullText) {

    Document entitiesList = new Document();
    NERClassifierCombiner nerCombClassifier = loadNERClassifiers();

    if (nerCombClassifier != null) {

        List<List<CoreLabel>> results = nerCombClassifier.classify(fullText);

        for (List<CoreLabel> coreLabels : results) {

            String prevLabel = null;
            String prevToken = null;

            for (CoreLabel coreLabel : coreLabels) {

                String word = coreLabel.word();
                String annotation = coreLabel.get(CoreAnnotations.AnswerAnnotation.class);

                if (!"O".equals(annotation)) {

                    if (prevLabel == null) {
                        prevLabel = annotation;
                        prevToken = word;
                    } else {

                        if (prevLabel.equals(annotation)) {
                            prevToken += " " + word;
                        } else {
                            prevLabel = annotation;
                            prevToken = word;
                        }
                    }
                } else {

                    if (prevLabel != null) {
                        entitiesList.put(prevToken, prevLabel);
                        prevLabel = null;
                    }
                }
            }
        }
    }

    return entitiesList;
}

輸入：

Document: org.bson.Document;
NERClassifierCombiner: edu.stanford.nlp.ie.NERClassifierCombiner;

nlp - Stanford NamedEntityRecognizerの複数用語の名前付きエンティティ

8 に答える 8

Related

Reference