4

段落から簡単な文を抽出するために使用できるアルゴリズムはありますか?

私の最終的な目標は、作成者の感情を判断するために、結果として得られた単純な文に対して後で別のアルゴリズムを実行することです。

Chae-Deug Parkなどの情報源からこれを調査しましたが、トレーニングデータとして簡単な文章を準備することについては議論されていません。

前もって感謝します

4

2 に答える 2

2

ApacheOpenNLPを見てください。SentenceDetectorモジュールがあります。ドキュメントには、コマンドラインおよびAPIからの使用方法の例が含まれています。

于 2012-04-17T15:16:38.580 に答える
1

同じようにopenNLPを使用しました。

public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException,
        InvalidFormatException {

    InputStream is = new FileInputStream("resources/models/en-sent.bin");
    SentenceModel model = new SentenceModel(is);
    SentenceDetectorME sdetector = new SentenceDetectorME(model);

    String[] sentDetect = sdetector.sentDetect(paragraph);
    is.close();
    return Arrays.asList(sentDetect);
}

    //Failed at Hi.
    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone

    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.

    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

人為的ミスがあった場合にのみ失敗しました。例えば。「博士」略語には大文字のDを付ける必要があり、2つの文の間に少なくとも1つのスペースが必要です。

次の方法でREを使用してそれを達成することもできます。

public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){
    List<String> sentences = new ArrayList<String>();
    Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
    Matcher reMatcher = re.matcher(paragraph);
    while (reMatcher.find()) {
        sentences.add(reMatcher.group());
    }
    return sentences;

}

    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Mr., mrs.
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at U.S.
    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

しかし、エラーは競争的に高いです。もう1つの方法は、BreakIteratorを使用することです。

public static List<String> breakIntoSentencesBreakIterator(String paragraph){
    List<String> sentences = new ArrayList<String>();
    BreakIterator sentenceIterator =
            BreakIterator.getSentenceInstance(Locale.ENGLISH);
    BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
    sentenceInstance.setText(paragraph);

    int end = sentenceInstance.last();
     for (int start = sentenceInstance.previous();
          start != BreakIterator.DONE;
          end = start, start = sentenceInstance.previous()) {
         sentences.add(paragraph.substring(start,end));
     }

     return sentences;
}

例:

    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Mr.
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));


    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

ベンチマーク

  • カスタムRE:7ミリ秒
  • BreakIterator:143ミリ秒
  • openNlp:255ミリ秒
于 2015-07-26T11:13:18.137 に答える