java - Using Apache POI with RegEx to extract Uppercase Words

Question

So I am working on a Project to extract Uppercase words out of a .doc file in Java. I am using regex, but the regex below was used by someonelse in a old .vba script. I need to find All Uppercase Words that are surrounded by Parenthesis. For Example (WORD). I know the regex below will give me a dangling meta character error so what would the regex be for this.

private static final String REGEX = "(*[A-Z]*[A-Z]*)";
private void parseWordText(File file) throws IOException { 
    FileInputStream fs = new FileInputStream(file); 
    HWPFDocument doc = new HWPFDocument(fs); 
    WordExtractor we = new WordExtractor(doc); 
    if (we.getParagraphText() != null) { 
        String[] dataArray = we.getParagraphText(); 
        for (int i = 0; i < dataArray.length; i++) { 
            String data = dataArray[i].toString(); 
            Pattern p = Pattern.compile(REGEX); 
            Matcher m = p.matcher(data); 
            List<String> sequences = new Vector<String>(); 
            while (m.find()) { 
                sequences.add(data.substring(m.start(), m.end())); 
                System.out.println(data.substring(m.start(), m.end())); 
            } 
        } 
    } 
}

With the code above and the regex I am getting two upper case letters, not just the all upper case words with the parens.

score 1 · Accepted Answer

括弧は正規表現では予約文字であるため、最初*は何も変更しません。少なくとも、それらをエスケープする必要があります。

\(*[A-Z]*[A-Z]*\)

ただし、まだ読むのをやめないでください！上記の正規表現は次のものと同じであることに注意してください。

\(*[A-Z]*\)

しかし、最も重要なことは、それがあなたが望む正規表現ではないと思います。かっこで囲まれたゼロ以外の数の連続する大文字をキャプチャしようとしていると思います。または：

\([A-Z]+\)

'+'は1つ以上の一致であり、左のパレンの繰り返しを停止したことに気付くでしょう。ボーナスポイントについては、括弧の最初または最後の空白を処理することをお勧めします。

\(\s*[A-A]+\s*\)

ただし、これは新しい行間で一致することに注意してください。お役に立てれば！

java - Using Apache POI with RegEx to extract Uppercase Words

1 に答える 1

Related

Reference