java - Pattern.CASE_INSENSITIVEを使用したJava正規表現のパフォーマンス

Question

使用している非常に単純な正規表現を取得しました

%%(products?)%%

今、私はそれが両方の製品に一致できるようにしたいですか？と製品？明らかな答えは、パターンをコンパイルするときにCASE_INSENSITIVEタグを使用することです。

Pattern.compile("%%(products?)%%", Pattern.CASE_INSENSITIVE)

ただし、ドキュメントには、「このフラグを指定すると、パフォーマンスがわずかに低下する可能性がある」と記載されています。したがって、私はフラグのない代替案を考えました。

Pattern.compile("%%([Pp]roducts?)%%")

私の質問は：どちらがより良いパフォーマンスを持っているでしょうか？

score 3 · Accepted Answer

大文字と小文字を区別しないバージョンは

Pattern.compile("%%([Pp][Rr][Oo][Dd][Uu][Cc][Tt][Ss]?)%%")

なんらかのパフォーマンスペナルティが発生することは明らかです。

したがって、あなたの場合、最後のバージョンはわずかに効果的です（また、より制限されています）。
ただし、この場合 (そしておそらくほとんどの場合)、ペナルティは無視できるほど小さいと言えます。アプリケーションのパフォーマンスが非常に高い場合は、いつでもベンチマークを実行して、スピードアップが顕著かどうかを確認できます。

score 2 · Accepted Answer

Actually, there is a significant difference between the methods.

While Pattern.compile("%%(products?)%%", Pattern.CASE_INSENSITIVE) might seem less efficient than Pattern.compile("%%([Pp]roducts?)%%") at first glance, it's internal functioning is not exactly that of comparing each character with both their lower' and uppercase counterparts; What actually happens is that the first method does a range-check with Unicode's lower' and uppercase blocks, while the second makes literal comparison.

I don't have knowledge much deeper than that, but the important part is this simple, but very interesting test (results on my machine included at the end):

String base = "I have a product that is the product of my hard work." 
  + "Products are always nice, because I can win cash if I sell my products.\n" 
  + "The product of me making my product is cash, because cash is the product of selling my product.\n" 
  + "With the cash I win with my product, I can buy other people's products.";

int processRepeats = 1000000; //One million runs, enough to take time for each clocking.
int averageRepeats = 10;

long averager = 0;
int count = 0;

//Switch the commenting to test the opposing method.
Pattern p = Pattern.compile("products?", Pattern.CASE_INSENSITIVE);
//Pattern p = Pattern.compile("[Pp]roducts?");
Matcher m;
long clocking;
for (int i = 0; i < averageRepeats; i++) {
  clocking = System.nanoTime();
  for (int ii = 0; ii < processRepeats; ii++) {
    m = p.matcher(base); //Here because the "base" would change in a real environment.
    while (m.find()) {
      count++;
    }
  }
  clocking = System.nanoTime() - clocking;
  averager += clocking;
  //System.out.printf("This method found %9d matches in %15d nanos [%9.3f ms]\n", count, clocking, clocking / 1000000f);
}
System.out.printf("This method averages %15d nanos [%16.3f ms] for %d times executing %d runs.\n",
averager / averageRepeats, (averager / (float) averageRepeats) / 1000000f, averageRepeats, processRepeats);

//RESULTS ON MY MACHINE:

//FIRST METHOD: [3 runs to demonstrate/guarantee consistency]
//This method averages      5024404693 nanos [        5024,404 ms] for 10 times executing 1000000 runs.
//This method averages      5021385539 nanos [        5021,386 ms] for 10 times executing 1000000 runs.
//This method averages      5017170143 nanos [        5017,170 ms] for 10 times executing 1000000 runs.

//SECOND METHOD: [same deal]
//This method averages      5806310774 nanos [        5806,311 ms] for 10 times executing 1000000 runs.
//This method averages      5809879747 nanos [        5809,880 ms] for 10 times executing 1000000 runs.
//This method averages      5804277386 nanos [        5804,277 ms] for 10 times executing 1000000 runs.

As you can see, not only the first method is faster (at last depending on the machine it's running), but also the performance difference of almost 800 ms (8/10 s), considering a large amount of runs, might not be as negligible an impact as one might think!

java - Pattern.CASE_INSENSITIVEを使用したJava正規表現のパフォーマンス

2 に答える 2

Related

Reference