java - 先読みによる正規表現のパフォーマンス/速度の低下

Question

私は、いくつかの先読みを伴う式で非常に遅い実行時間を観察してきました。

これは基礎となるデータ構造によるものだと思いますが、かなり極端なようで、何か間違ったことをしたのか、それとも既知の回避策があるのか疑問に思っています.

問題は、一連の単語が任意の順序で文字列に存在するかどうかを判断することです。たとえば、「term1」と「term2」という 2 つの用語が文字列のどこかにあるかどうかを調べたいとします。私は式でこれを行います：

(?=.*\bterm1\b)(?=.*\bterm2\b)

しかし、私が観察したことは、これは最初にチェックするよりも桁違いに遅いということです

\bterm1\b

そしてちょうどその時

\bterm2\b

これは、先読みのある単一のパターンではなく、パターンの配列を使用する必要があることを示しているようです...これは正しいですか? それは間違っているようです...

テストコードの例と結果の時間は次のとおりです。

public static void speedLookAhead() {
    Matcher m, m1, m2;
    boolean find;
    int its = 1000000;

    // create long non-matching string
    char[] str = new char[2000];
    for (int i = 0; i < str.length; i++) {
      str[i] = 'x';
    }
    String test = str.toString();

    // First method: use one expression with lookaheads
    m = Pattern.compile("(?=.*\\bterm1\\b)(?=.*\\bterm2\\b)").matcher(test);
    long time = System.currentTimeMillis();
    ;
    for (int i = 0; i < its; i++) {
      m.reset(test);
      find = m.find();
    }
    time = System.currentTimeMillis() - time;
    System.out.println(time);

    // Second method: use two expressions and AND the results
    m1 = Pattern.compile("\\bterm1\\b").matcher(test);
    m2 = Pattern.compile("\\bterm2\\b").matcher(test);
    time = System.currentTimeMillis();
    ;
    for (int i = 0; i < its; i++) {
      m1.reset(test);
      m2.reset(test);
      find = m1.find() && m2.find();
    }
    time = System.currentTimeMillis() - time;
    System.out.println(time);
  }

これは私のコンピューターに出力されます：

1754
150

score 2 · Accepted Answer

これで時間が短縮できるかもしれません

よく深い
([AB]).*(?!\1)[AB]

貪欲でない
([AB]).*?(?!\1)[AB]

やり直す

私はこの問題について自分のベンチを作りました。1 つの正規表現で 2 つの用語を照合するのではなく、一度に 1 つの用語を照合すると、バックトラッキング/term/
が行われないため、常に時間がかかりません
。strncmp(term) と同じくらい簡単です。そして、2つの項を別々に行うと、
はるかに高速になります。

重複の可能性がないように用語を定義できる場合は、これが
道です。すなわち; /term1/ && /term2/.

バックトラッキングを呼び出さずに用語を単一の正規表現に結合する方法はありません。

つまり、オーバーラップを実際に気にするのであれば、
バックトラッキングを最小限に抑えるテクニックがあります。

/(?=.*A)(?=.*B)/ は /A/ && /B/ とまったく同じですが、どちらもオーバーラップを考慮していないため、かなり遅くなります。

したがって、実際にオーバーラップを気にする場合 (そして、私は強くお勧めします)、
最大限の効率を得るために組み合わせることができる 2 つの方法があります。

/(A|B) .* (?!\1)(?:A|B)/

また

/A/ && /B/ && /(A|B) .* (?!\1)(?:A|B)/

この最後のものは小さな (相対的な) オーバーヘッドを追加しますが、オーバーラップをチェックする前に、少なくとも A と B が存在する必要があるため、ロジックチェーン内のアクセスを禁止する可能性があります。

また、A と B が文字列のどこに存在するかによって、 /(A|B) .* (?!\1)(?:A|B)/
にも時間がかかる可能性がありますが、それでも最短の方法です。すべてが
平均化されます。

以下は、いくつかのサンプル (考えられるシナリオ) 文字列をベンチマークする Perl プログラムです。

幸運を！

use strict;
use warnings;

use Benchmark ':hireswallclock';
my ($t0,$t1);

my ($term1, $term2) = ('term','m2a');
my  @samples = (
   ' xaaaaaaa term2ater  ',
   ' xaaaaaaa term2aterm ',
   ' xaaaaaaa ter2ater  ',
   ' Aaa term2ater ' . 'x 'x100 . 'xaaaaaaa mta ',
   ' Baa term      ' . 'x 'x100 . 'xaaaaaaa mta ',
   ' Caa m2a       ' . 'x 'x100 . 'xaaaaaaa term ',
   ' Daa term2a       ' . 'x 'x100 . 'xaaaaaaa term ',
);

my $rxA  = qr/$term1/;
my $rxB  = qr/$term2/;
my $rxAB = qr/ ($term1|$term2) .* (?!\1)(?:$term1|$term2) /x;


for (@samples)
{
    printf "Checking string:  '%.40s'\n-------------\n", $_;

    if (/$term1/ && /$term2/ ) {
       print "  Found possible candidates (A && B)\n";
    }
    if (/ ($term1|$term2) .* ((?!\1)(?:$term1|$term2)) /x) {
       print "  Found non-overlaped terms: '$1'  '$2'\n";
    }
    else {
       print "  No (A|B) .* (?!\\1)(A|B) terms found!\n";
    }
    print "\n   Bench\n";

    $t0 = new Benchmark;
    for my $cnt (1 .. 500_000) {
       /$rxA/  &&  /$rxB/;
    }
    $t1 = new Benchmark;
    print "    $rxA && $rxB\n    -took: ", timestr(timediff($t1, $t0)), "\n\n";

    $t0 = new Benchmark;
    for my $cnt (1 .. 500_000) {
       /$rxAB/;
    }
    $t1 = new Benchmark;
    print "    $rxAB\n    -took: ", timestr(timediff($t1, $t0)), "\n\n";

    $t0 = new Benchmark;
    for my $cnt (1 .. 500_000) {
       /$rxA/  &&  /$rxB/ && /$rxAB/;
    }
    $t1 = new Benchmark;
    print "    $rxA && $rxB &&\n    $rxAB\n    -took: ", timestr(timediff($t1, $t0)), "\n\n";

}

出力

Checking string:  ' xaaaaaaa term2ater  '
-------------
  Found possible candidates (A && B)
  No (A|B) .* (?!\1)(A|B) terms found!

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.46875 wallclock secs ( 1.47 usr +  0.00 sys =  1.47 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 3.3748 wallclock secs ( 3.34 usr +  0.00 sys =  3.34 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 5.0623 wallclock secs ( 5.06 usr +  0.00 sys =  5.06 CPU)

Checking string:  ' xaaaaaaa term2aterm '
-------------
  Found possible candidates (A && B)
  Found non-overlaped terms: 'm2a'  'term'

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.48403 wallclock secs ( 1.49 usr +  0.00 sys =  1.49 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 3.89044 wallclock secs ( 3.89 usr +  0.00 sys =  3.89 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 5.40607 wallclock secs ( 5.38 usr +  0.00 sys =  5.38 CPU)

Checking string:  ' xaaaaaaa ter2ater  '
-------------
  No (A|B) .* (?!\1)(A|B) terms found!

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 0.765321 wallclock secs ( 0.77 usr +  0.00 sys =  0.77 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 1.29674 wallclock secs ( 1.30 usr +  0.00 sys =  1.30 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 0.874842 wallclock secs ( 0.88 usr +  0.00 sys =  0.88 CPU)

Checking string:  ' Aaa term2ater x x x x x x x x x x x x x'
-------------
  Found possible candidates (A && B)
  No (A|B) .* (?!\1)(A|B) terms found!

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.46842 wallclock secs ( 1.47 usr +  0.00 sys =  1.47 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 28.078 wallclock secs (28.08 usr +  0.00 sys = 28.08 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 29.4531 wallclock secs (29.45 usr +  0.00 sys = 29.45 CPU)

Checking string:  ' Baa term      x x x x x x x x x x x x x'
-------------
  No (A|B) .* (?!\1)(A|B) terms found!

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.68716 wallclock secs ( 1.69 usr +  0.00 sys =  1.69 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 15.1563 wallclock secs (15.16 usr +  0.00 sys = 15.16 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 1.64033 wallclock secs ( 1.64 usr +  0.00 sys =  1.64 CPU)

Checking string:  ' Caa m2a       x x x x x x x x x x x x x'
-------------
  Found possible candidates (A && B)
  Found non-overlaped terms: 'm2a'  'term'

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.62448 wallclock secs ( 1.63 usr +  0.00 sys =  1.63 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 3.0154 wallclock secs ( 3.02 usr +  0.00 sys =  3.02 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 4.56226 wallclock secs ( 4.56 usr +  0.00 sys =  4.56 CPU)

Checking string:  ' Daa term2a       x x x x x x x x x x x '
-------------
  Found possible candidates (A && B)
  Found non-overlaped terms: 'm2a'  'term'

   Bench
    (?-xism:term) && (?-xism:m2a)
    -took: 1.45252 wallclock secs ( 1.45 usr +  0.00 sys =  1.45 CPU)

    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 16.1404 wallclock secs (16.14 usr +  0.00 sys = 16.14 CPU)

    (?-xism:term) && (?-xism:m2a) &&
    (?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
    -took: 17.6719 wallclock secs (17.67 usr +  0.00 sys = 17.67 CPU)

score 1 · Accepted Answer

各ループを別々のメソッドに配置する必要があります。テストの順序を入れ替えると、異なる結果が得られます。

test.indexOf('A') >= 0 && test.indexOf('B') >= 0これははるかに高速になる可能性があると想像しているので、これと比較できますか?

score 1 · Accepted Answer

投稿した正規表現

(?=.\A\b)(?=.\B\b)

コード内のものと一致しません

.(?=.*B)(?=.*A)

実際、最初の正規表現は、おそらく一致しないようです。

一致する必要があるものと一致しないもののサンプル入力をいくつか教えてください。

これは、説明したコードの正規表現です。

Match any single character that is not a line break character «.»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*B)»
   Match any single character that is not a line break character «.*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the character “B” literally «B»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*A)»
   Match any single character that is not a line break character «.*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
   Match the character “A” literally «A»

java - 先読みによる正規表現のパフォーマンス/速度の低下

3 に答える 3

Related

Reference