java - POSタグから名詞語と原文を抽出

Question

文章から名詞を抜き出し、POSタグから元の文章を取り戻したい

 //Extract the words before _NNP & _NN from below  and also how to get back the original sentence from the Pos TAG. 
 Original Sentence:Hi. How are you? This is Mike·
 POSTag: Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN

私はこのようなことを試しました

    String txt = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";


    String re1 = "((?:[a-z][a-z0-9_]*))";   // Variable Name 1
    String re2 = ".*?"; // Non-greedy match on filler
    String re3 = "(_)"; // Any Single Character 1
    String re4 = "(NNP)";   // Word 1

    Pattern p = Pattern.compile(re1 + re2 + re3 + re4, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    Matcher m = p.matcher(txt);
    if (m.find()) {
        String var1 = m.group(1);
        System.out.print(  var1.toString()  );
    }
}

出力: やあ、でも文中のすべての名詞のリストが必要なんだ。

score 5 · Accepted Answer

名詞を抽出するには、次のようにします。

public static String[] extractNouns(String sentenceWithTags) {
    // Split String into array of Strings whenever there is a tag that starts with "._NN"
    // followed by zero, one or two more letters (like "_NNP", "_NNPS", or "_NNS")
    String[] nouns = sentenceWithTags.split("_NN\\w?\\w?\\b");
    // remove all but last word (which is the noun) in every String in the array
    for(int index = 0; index < nouns.length; index++) {
        nouns[index] = nouns[index].substring(nouns[index].lastIndexOf(" ") + 1)
        // Remove all non-word characters from extracted Nouns
        .replaceAll("[^\\p{L}\\p{Nd}]", "");
    }
    return nouns;
}

元の文を抽出するには、次のようにします。

public static String extractOriginal(String sentenceWithTags) {
    return sentenceWithTags.replaceAll("_([A-Z]*)\\b", "");
}

動作することの証明:

public static void main(String[] args) {
    String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";
    System.out.println(java.util.Arrays.toString(extractNouns(sentence)));
    System.out.println(extractOriginal(sentence));
}

出力：

[Hi, Mike]
Hi. How are you? This is Mike.

注: 抽出された名詞から単語以外のすべての文字 (句読点など) を削除する正規表現については、この Stack Overflow question/answerを使用しました。

score 1 · Accepted Answer

while (m.find())の代わりに使用if (m.find())して、すべての一致を反復処理します。

さらに、正規表現は非常に単純化できます。

データをキャプチャする必要がない場合は、括弧を入れないでください (通常)。
あなたが使用し((?:...))ているのは非常に奇妙です。キャプチャグループ内に直接ネストされた非キャプチャグループは意味がありません。
その.*?部分があなたの期待どおりに機能するかどうかはわかりません。ドットに一致させたい場合は、[.]代わりに使用してください。

したがって、([a-z][a-z0-9_]*)[.]_NNP代わりに試してください。

または、正の先読みを使用することもできます: [a-z][a-z0-9_]*(?=[.]_NNP). m.group()キャプチャされたデータにアクセスするために使用します。

score 1 · Accepted Answer

これはうまくいくはずです

import java.util.ArrayList;
public class Test {

public static final String NOUN_REGEX = "[a-zA-Z]*_NN\\w?\\w?\\b";

public static ArrayList<String> extractNounsByRegex(String sentenceWithTags) {
    ArrayList<String> nouns = new ArrayList<String>();
    String[] words = sentenceWithTags.split("\\s+");
    for (int i = 0; i < words.length; i++) {
        if(words[i].matches(NOUN_REGEX)) {
            System.out.println(" Matched ");
            //remove the suffix _NN* and retain  [a-zA-Z]*
                nouns.add(words[i].replaceAll("_NN\\w?\\w?\\b", ""));
            }
        }
        return nouns;
    }

    public static String extractOriginal(String word) {
                return word.replaceAll("_NN\\w?\\w?\\b", "");
    }

    public static void main(String[] args) {
        //        String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";
        String sentence = "Eiffel_NNP tower_NN is_VBZ in_IN paris_NN Hi_NNP How_WRB are_VBP you_PRP This_DT is_VBZ Mike_NNP Barrack_NNP Obama_NNP is_VBZ a_DT president_NN this_VBZ";
        System.out.println(extractNounsByRegex(sentence).toString());
        System.out.println(sentence);
    }
}

java - POSタグから名詞語と原文を抽出

3 に答える 3

Related

Reference