regex - 辞書ハッシュからの正規表現の単純な置換ドキュメント (Perl)

Question

大きなドキュメントのハッシュからキーワードをできるだけ早く見つけて置き換える必要があります。私は以下の 2 つの方法にうんざりしています。

辞書ハッシュに存在するキーワードのみを置き換え、存在しないキーワードを保持して、辞書にないことを知りたいという考えです。

以下の両方の方法は、2回スキャンして、私が思うように見つけて置き換えます。ルックアヘッドまたはビハインドのような正規表現は、はるかに高速に最適化できると確信しています。

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(:all);

my %dictionary = (
            pollack => "pollard",
            polynya => "polyoma",
            pomaces => "pomaded",
            pomades => "pomatum",
            practic => "praetor",
            prairie => "praised",
            praiser => "praises",
            prajnas => "praline",
            quakily => "quaking",
            qualify => "quality",
            quamash => "quangos",
            quantal => "quanted", 
            quantic => "quantum",
    );

my $content =qq{
        Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
        can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
        Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
        one kind of problem {qualify} {doesNotExist} end.
    };

# just duplicate content many times
$content .= $content;

cmpthese(100000, {
    replacer_1 => sub {my $text = replacer1($content)},
    replacer_2 => sub {my $text = replacer2($content)},
});

print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;

sub replacer1 {
    my ($content) = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub replacer2 {
    my ($content) = shift;
    my @names = $content =~ /\{(.+?)\}/g;
    foreach my $name (@names) {
        if (exists $dictionary{$name}) {
            $content =~ s/\{$name\}/\[$dictionary{$name}\]/;
        }
    }
    return $content;
}

ベンチマーク結果は次のとおりです。

              Rate replacer_2 replacer_1
replacer_2  5565/s         --       -76%
replacer_1 23397/s       320%         --

score 3 · Accepted Answer

事前に任意のハッシュキーに一致する正規表現を作成しておくと役立ちます。このような

my $pattern = join '|', sort {length $b <=> length $a } keys %dictionary;
$pattern = qr/$pattern/;

sub replacer4 {
    my ($string) = @_;
    $string =~ s# \{ ($pattern) \} #"[$dictionary{$1}]"#gex;
    $string;
}

これらの結果で

              Rate replacer_2 replacer_1 replacer_3 replacer_4
replacer_2  4883/s         --       -80%       -84%       -85%
replacer_1 24877/s       409%         --       -18%       -22%
replacer_3 30385/s       522%        22%         --        -4%
replacer_4 31792/s       551%        28%         5%         --

また、毎回追加するのではなく、ハッシュに中かっこと大かっこを追加できれば改善されます。

score 3 · Accepted Answer

少し高速でコンパクトな方法を次に示します。

sub replacer3 {
    my ($content) = shift;
    $content =~ s#\{(.+?)\}#"[".($dictionary{$1} // $1)."]"#ge;
    return $content;
}

Perl 5.8 では、辞書の値が "false" でない場合||に代わりに使用しても問題ありません。//

中括弧と大括弧が既に含まれている辞書を使用することで得られるものも少しあります。

sub replacer5 {
    my ($content) = shift;
    our %dict2;
    if (!%dict2) {
        %dict2 = map { "{".$_."}" => "[".$dictionary{$_}."]" } keys %dictionary
    }
    $content =~ s#(\{.+?\})#$dict2{$1} || $1#ge;
    return $content;
}

ベンチマーク結果:

              Rate replacer_2 replacer_1 replacer_3 replacer_5
replacer_2  2908/s         --       -79%       -83%       -84%
replacer_1 14059/s       383%         --       -20%       -25%
replacer_3 17513/s       502%        25%         --        -7%
replacer_5 18741/s       544%        33%         7%         --

score 2 · Accepted Answer

ベンチマークサブルーチンには意味のある名前を使用することをお勧めします。これにより、出力と意図がより明確になります。

以下は、ボロディンと暴徒が試したことを少し再現し、それらを組み合わせたものです。

#!/usr/bin/perl

use strict;
use warnings;
use feature 'state';

use Benchmark qw(:all);

# Data separated by paragraph mode.
my %dictionary = split ' ', do {local $/ = ''; <DATA>};
my $content = do {local $/; <DATA>};

# Quadruple Content
$content = $content x 4;

cmpthese(100_000, {
    original        => sub { my $text = original($content) },
    build_list      => sub { my $text = build_list($content) },
    xor_regex       => sub { my $text = xor_regex($content) },
    list_and_xor    => sub { my $text = list_and_xor($content) },
});

exit;

sub original {
    my $content = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub build_list {
    my $content = shift;
    state $list = join '|', map quotemeta, keys %dictionary;
    $content =~ s/\{($list)\}/[$dictionary{$1}]/gx;
    return $content;
}

sub xor_regex {
    my $content = shift;

    state $with_brackets = {
        map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
    };

    $content =~ s{(\{.+?\})}{$with_brackets->{$1} // $1}gex;

    return $content;
}

sub list_and_xor {
    my $content = shift;

    state $list = join '|', map quotemeta, keys %dictionary;
    state $with_brackets = {
        map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
    };

    $content =~ s{(\{(?:$list)\})}{$with_brackets->{$1} // $1}gex;

    return $content;
}

__DATA__
pollack pollard
polynya polyoma
pomaces pomaded
pomades pomatum
practic praetor
prairie praised
praiser praises
prajnas praline
quakily quaking
qualify quality
quamash quangos
quantal quanted 
quantic quantum

Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.

出力:

                Rate     original    xor_regex   build_list list_and_xor
original     19120/s           --         -23%         -24%         -29%
xor_regex    24938/s          30%           --          -1%          -8%
build_list   25253/s          32%           1%           --          -7%
list_and_xor 27027/s          41%           8%           7%           --

私のソリューションでは、変数を多用して、state静的データ構造の再初期化を回避しています。ただし、クロージャまたはour $var; $var ||= VAL.

正規表現の LHS の強化に関する補遺

実際、明示的なリストを使用するように LHS を編集すると、正規表現が改善されます。そして、この変更により、速度が 30% 向上しました。

これに対する魔法の解決策はおそらくありません。置き換えたい値のリストがあります。この目標の言葉を単純化する不思議な方法があるわけではありません。

単語が辞書ハッシュに存在しない場合は、LHS でコードブロックを使用して失敗し、スキップすることができます。ただし、以下は、これが実際には元の方法よりも 36% 遅いことを示しています。

sub skip_fail {
    my $content = shift;

    $content =~ s{\{(.+?)\}(?(?{! $dictionary{$1}})(*SKIP)(*FAIL))}{[$dictionary{$1}]}gx;

    return $content;
}

出力:

                Rate   skip_fail    original   xor_regex build_list list_and_xor
skip_fail     6769/s          --        -36%        -46%       -49%         -53%
original     10562/s         56%          --        -16%       -21%         -27%
xor_regex    12544/s         85%         19%          --        -6%         -14%
build_list   13355/s         97%         26%          6%         --          -8%
list_and_xor 14537/s        115%         38%         16%         9%           --

regex - 辞書ハッシュからの正規表現の単純な置換ドキュメント (Perl)

3 に答える 3

正規表現の LHS の強化に関する補遺

Related

Reference