java - Stanford パーサーを使用してテキストを文に分割するにはどうすればよいですか?

Question

Stanford parserを使用してテキストまたは段落を文に分割するにはどうすればよいですか?

RubygetSentencesFromString()で提供されているような、文を抽出できるメソッドはありますか?

score 31 · Accepted Answer

DocumentPreprocessor クラスを確認できます。以下は短いスニペットです。あなたが望むことをする他の方法があるかもしれないと思います。

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();

for (List<HasWord> sentence : dp) {
   // SentenceUtils not Sentence
   String sentenceString = SentenceUtils.listToString(sentence);
   sentenceList.add(sentenceString);
}

for (String sentence : sentenceList) {
   System.out.println(sentence);
}

score 24 · Accepted Answer

すでに受け入れられている回答があることは知っていますが、通常は、注釈付きドキュメントから SentenceAnnotations を取得するだけです。

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = ... // Add your text here!

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);       
  }

}

ソース - http://nlp.stanford.edu/software/corenlp.shtml (途中まで)

また、文だけを探している場合は、パイプラインの初期化から「parse」や「dcoref」などの後のステップを削除できます。これにより、負荷と処理時間が節約されます。ロックンロール。~K

score 17 · Accepted Answer

受け入れられた答えにはいくつかの問題があります。まず、トークナイザーは、文字 “ などのいくつかの文字を 2 つの文字 `` に変換します。第 2 に、トークン化されたテキストを空白で結合しても、以前と同じ結果が返されません。したがって、受け入れられた回答のサンプルテキストは、入力テキストを重要な方法で変換します。

ただし、CoreLabelトークナイザーが使用するクラスは、マップ先のソース文字を追跡するため、元の文字列がある場合、適切な文字列を再構築するのは簡単です。

以下のアプローチ 1 は、受け入れられた回答のアプローチを示しています。アプローチ 2 は、これらの問題を克服する私のアプローチを示しています。

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";

List<String> sentenceList;

/* ** APPROACH 1 (BAD!) ** */
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
    sentenceList.add(Sentence.listToString(sentence));
}
System.out.println(StringUtils.join(sentenceList, " _ "));

/* ** APPROACH 2 ** */
//// Tokenize
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), "");
while (tokenizer.hasNext()) {
    tokens.add(tokenizer.next());
}
//// Split sentences from tokens
List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens);
//// Join back together
int end;
int start = 0;
sentenceList = new ArrayList<String>();
for (List<CoreLabel> sentence: sentences) {
    end = sentence.get(sentence.size()-1).endPosition();
    sentenceList.add(paragraph.substring(start, end).trim());
    start = end;
}
System.out.println(StringUtils.join(sentenceList, " _ "));

これは以下を出力します:

My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence .
My 1st sentence. _ “Does it work for questions?” _ My third sentence.

score 9 · Accepted Answer

.net C# パッケージの使用: これにより、文が分割され、括弧が正しく取得され、元のスペースと句読点が保持されます。

public class NlpDemo
{
    public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
                "normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true");

    public void ParseFile(string fileName)
    {
        using (var stream = File.OpenRead(fileName))
        {
            SplitSentences(stream);
        }
    }

    public void SplitSentences(Stream stream)
    {            
        var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream)));
        preProcessor.setTokenizerFactory(TokenizerFactory);

        foreach (java.util.List sentence in preProcessor)
        {
            ProcessSentence(sentence);
        }            
    }

    // print the sentence with original spaces and punctuation.
    public void ProcessSentence(java.util.List sentence)
    {
        System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence));
    }
}

入力: - この文の文字には、句読点や散文によく見られる特定の魅力があります。これは2番目の文ですか？まったくそのとおり。

出力: 3 つの文 (「?」は文末の区切り文字と見なされます)

注: 「Mrs. Havisham のクラスは (見た限りでは!) すべての面で非の打ち所がありませんでした」のような文の場合。トークナイザーは、Mrs. の末尾のピリオドが EOS ではないことを正しく認識しますが、! を誤ってマークします。括弧内は EOS として、「すべての面で」分割されます。第二文として。

score 1 · Accepted Answer

ドキュメントプリプロセッサを使用できます。とても簡単です。ファイル名を入力するだけです。

    for (List<HasWord> sentence : new DocumentPreprocessor(pathto/filename.txt)) {
         //sentence is a list of words in a sentence
    }

score 0 · Accepted Answer

質問を解決する@Kevin回答のバリエーションは次のとおりです。

for(CoreMap sentence: sentences) {
      String sentenceText = sentence.get(TextAnnotation.class)
}

これにより、他のアノテーターに煩わされることなく、文の情報が得られます。

score -4 · Accepted Answer

public class k {

public static void main(String a[]){

    String str = "This program splits a string based on space";
    String[] words = str.split(" ");
    for(String s:words){
        System.out.println(s);
    }
    str = "This     program  splits a string based on space";
    words = str.split("\\s+");
}
}

java - Stanford パーサーを使用してテキストを文に分割するにはどうすればよいですか?

12 に答える 12

Related

Reference