java - Java 正規表現は ascii 範囲外では一致せず、python 正規表現とは異なる動作をします

翻译自：https://stackoverflow.com/questions/49409074 2018-03-21T14:32:23.687

426 次

sklearn のCountVectorizerと同じ方法で、ドキュメントから文字列をフィルター処理したいと考えています。次の正規表現を使用します: (?u)\b\w\w+\b. この Java コードは同じように動作する必要があります。

Pattern regex = Pattern.compile("(?u)\\b\\w\\w+\\b");
Matcher matcher = regex.matcher("this is the document.!? äöa m²");

while(matcher.find()) {
    String match = matcher.group();
    System.out.println(match);
}

しかし、これは Python の場合のように、目的の出力を生成しません。

this
is
the
document
äöa
m²

代わりに以下を出力します。

this
is
the
document

Python RegeEx のように、ASCII 以外の文字を含めるにはどうすればよいですか?

java - Java 正規表現は ascii 範囲外では一致せず、python 正規表現とは異なる動作をします

2 に答える 2

Related

Reference