java - タグに基づいて行内の各文にスコアを付け、テキストを要約します。(ジャワ)

Question

Java でサマライザーを作成しようとしています。Stanford Log-linear Part-Of-Speech Taggerを使用して単語にタグを付けてから、特定のタグについて文にスコアを付け、最後に要約でスコア値の高い文を出力しています。コードは次のとおりです。

    MaxentTagger tagger = new MaxentTagger("taggers/bidirectional-distsim-wsj-0-18.tagger");

    BufferedReader reader = new BufferedReader( new FileReader ("C:\\Summarizer\\src\\summarizer\\testing\\testingtext.txt"));
    String line  = null;
    int score = 0;
    StringBuilder stringBuilder = new StringBuilder();
    File tempFile = new File("C:\\Summarizer\\src\\summarizer\\testing\\tempFile.txt");
    Writer writerForTempFile = new BufferedWriter(new FileWriter(tempFile));


    String ls = System.getProperty("line.separator");
    while( ( line = reader.readLine() ) != null )
    {
        stringBuilder.append( line );
        stringBuilder.append( ls );
        String tagged = tagger.tagString(line);
        Pattern pattern = Pattern.compile("[.?!]"); //Find new line
        Matcher matcher = pattern.matcher(tagged);
        while(matcher.find())
        {
            Pattern tagFinder = Pattern.compile("/JJ"); // find adjective tag
            Matcher tagMatcher = tagFinder.matcher(matcher.group());
            while(tagMatcher.find())
            {
                score++; // increase score of sentence for every occurence of adjective tag
            }
            if(score > 1)
                writerForTempFile.write(stringBuilder.toString());
            score = 0;
            stringBuilder.setLength(0);
        }

    }

    reader.close();
    writerForTempFile.close();

上記のコードは機能しません。ただし、作業を切り取り、すべての行 (文ではない) のスコアを生成すると、機能します。しかし、要約はそのようには生成されませんね。そのためのコードは次のとおりです: (すべての宣言は上記と同じです)

while( ( line = reader.readLine() ) != null )
        {
            stringBuilder.append( line );
            stringBuilder.append( ls );
            String tagged = tagger.tagString(line);
            Pattern tagFinder = Pattern.compile("/JJ"); // find adjective tag
            Matcher tagMatcher = tagFinder.matcher(tagged);
            while(tagMatcher.find())
            {
                score++;  //increase score of line for every occurence of adjective tag
            }
            if(score > 1)
                writerForTempFile.write(stringBuilder.toString());
            score = 0;
            stringBuilder.setLength(0);
        }

編集1：

MaxentTagger の機能に関する情報。機能していることを示すサンプルコード:

import java.io.IOException;

import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class TagText {
    public static void main(String[] args) throws IOException,
            ClassNotFoundException {

        // Initialize the tagger
        MaxentTagger tagger = new MaxentTagger(
                "taggers/bidirectional-distsim-wsj-0-18.tagger");

        // The sample string
        String sample = "This is a sample text";

        // The tagged string
        String tagged = tagger.tagString(sample);

        // Output the result
        System.out.println(tagged);
    }
}

出力：

This/DT is/VBZ a/DT sample/NN sentence/NN

編集2：

文の区切りを見つけるために BreakIterator を使用してコードを変更しました。それでも問題は解決しません。

while( ( line = reader.readLine() ) != null )
        {
            stringBuilder.append( line );
            stringBuilder.append( ls );
            String tagged = tagger.tagString(line);
            BreakIterator bi = BreakIterator.getSentenceInstance();
            bi.setText(tagged);
            int end, start = bi.first();
            while ((end = bi.next()) != BreakIterator.DONE)
            {
                String sentence = tagged.substring(start, end);
                Pattern tagFinder = Pattern.compile("/JJ");
                Matcher tagMatcher = tagFinder.matcher(sentence);
                while(tagMatcher.find())
                {
                    score++;
                }
                scoreTracker.add(score);
                if(score > 1)
                    writerForTempFile.write(stringBuilder.toString());
                score = 0;
                stringBuilder.setLength(0);
                start = end;
            }

score 3 · Accepted Answer

文の区切りを見つけることは、[.?!] を探すよりも少し複雑になる可能性があります。BreakIterator .getSentenceInstance() の使用を検討してください。

そのパフォーマンスは、実際には LingPipe の (より複雑な) 実装と非常に似ており、OpenNLP の実装よりも優れています (少なくとも私自身のテストから)。

サンプルコード

BreakIterator bi = BreakIterator.getSentenceInstance();
bi.setText(text);
int end, start = bi.first();
while ((end = bi.next()) != BreakIterator.DONE) {
    String sentence = text.substring(start, end);
    start = end;
}

編集

これがあなたが探しているものだと思います：

    Pattern tagFinder = Pattern.compile("/JJ");
    BufferedReader reader = getMyReader();
    String line = null;
    while ((line = reader.readLine()) != null) {
        BreakIterator bi = BreakIterator.getSentenceInstance();
        bi.setText(line);
        int end, start = bi.first();
        while ((end = bi.next()) != BreakIterator.DONE) {
            String sentence = line.substring(start, end);
            String tagged = tagger.tagString(sentence);
            int score = 0;
            Matcher tag = tagFinder.matcher(tagged);
            while (tag.find())
                score++;
            if (score > 1)
                writerForTempFile.println(sentence);
            start = end;
        }
    }

score 2 · Accepted Answer

すべてを理解していなくても、コードは次のようになるはずです。

    int lastMatch = 0;// Added

    Pattern pattern = Pattern.compile("[.?!]"); //Find new line
    Matcher matcher = pattern.matcher(tagged);
    while(matcher.find())
    {
        Pattern tagFinder = Pattern.compile("/JJ"); // find adjective tag

        // HERE START OF MY CHANGE
        String sentence = tagged.substring(lastMatch, matcher.end());
        lastMatch = matcher.end();
        Matcher tagMatcher = tagFinder.matcher(sentence);
        // HERE END OF MY CHANGE

        while(tagMatcher.find())
        {
            score++; // increase score of sentence for every occurence of adjective tag
        }
        if(score > 1)
            writerForTempFile.write(sentence);
        score = 0;
    }

java - タグに基づいて行内の各文にスコアを付け、テキストを要約します。(ジャワ)

2 に答える 2

サンプルコード

編集

Related

Reference