algorithm - Delphi 2010 String (UnicodeString) の Boyer-Moore 文字列検索、高速検索および置換機能、高速文字列カウントはありますか?

Question

高速検索、高速検索と置換、および文字列内の部分文字列の高速カウントの 3 つの高速オン大文字列関数が必要です。

私は C++ と Python で Boyer-Moore 文字列検索に出くわしましたが、高速検索と置換を実装するために使用されている唯一の Delphi Boyer-Moore アルゴリズムは、以前は DroopyEyes ソフトウェアの Peter Morris による FastStrings の一部であり、彼の Web サイトでした。および電子メールは機能しなくなりました。

Delphi 2009/2010 では、1 バイトが 1 つの AnsiChar に等しい AnsiString でうまく機能するように、 FastStringsを移植済みですが、Delphi 2010 で文字列 (UnicodeString) でも機能させるのは簡単ではないようです。

この Boyer-Moore アルゴリズムを使用すると、大文字と小文字を区別しない検索、および大文字と小文字を区別しない検索と置換を、一時的な文字列 (StrUpper などを使用) なしで、Boyer- よりも遅い Pos() を呼び出さずに簡単に実行できるはずです。同じテキストを繰り返し検索する必要がある場合の Moore 検索。

(編集: この質問への回答として書かれた部分的な解決策があります。ほぼ 100% 完成しており、高速な文字列置換機能も備えています。バグがあるに違いないと思います。特に、Unicode のふりをしているので、 Unicode の約束が果たされていないために不具合が発生している可能性があります。)

(Edit2: 興味深い予想外の結果。スタック上の Unicode コードポイントテーブルの大きなスタックサイズ - 以下のコードの SkipTable は、ここで Unicode 文字列ボイヤーで実行できる win-win-optimization の量に深刻なダンパーを置きます。 -moore 文字列検索.すぐに気付くべきだったことを指摘してくれた Florent Ouchet に感謝します.)

score 12 · Accepted Answer

この回答は現在完了しており、大文字と小文字を区別するモードでは機能しますが、大文字と小文字を区別しないモードでは機能しません。おそらく他のバグもあります。これは、単体テストが十分に行われておらず、さらに最適化できる可能性があるためです。たとえば、ローカル関数 __SameChar を繰り返しましたより高速な比較関数コールバックを使用する代わりに、実際には、ユーザーがこれらすべてに対して比較関数を渡すことができるようにすることは、いくつかの追加ロジックを提供したい Unicode ユーザーにとって素晴らしいでしょう (一部の言語の Unicode グリフの同等のセット) ）。

Dorin Dominica のコードに基づいて、以下を作成しました。

{ _FindStringBoyer:
  Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
  Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
  const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;

    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if bCaseSensitive then
        Result := (sString[StringIndex] = sPattern[PatternIndex])
      else
        Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

var
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
begin
  if fromPos < 1 then
    raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
  LengthPattern := Length(sPattern);
  LengthString := Length(sString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[sPattern[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[sPattern[LengthPattern]];
  SkipTable[sPattern[LengthPattern]] := Large;
  Index := fromPos + LengthPattern -1;
  Result := 0;
  while Index <= LengthString do begin
    repeat
      Index := Index + SkipTable[sString[Index]];
    until Index > LengthString;
    if Index <= Large then
      Break
    else
      Index := Index - Large;
    kIndex := 1;
    while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
      Inc(kIndex);
    if kIndex = LengthPattern then begin
      // Found, return.
      Result := Index - kIndex + 1;
      Index := Index + LengthPattern;
      exit;
    end else begin
      if __SameChar(Index, LengthPattern) then
        Index := Index + LastMarker
      else
        Index := Index + SkipTable[sString[Index]];
    end; // if kIndex = LengthPattern then begin
  end; // while Index <= LengthString do begin
end;

{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
  a substring inside the main string, at a much faster rate than we
  could have done otherwise.  Another thing that would be great is
  to have a function that returns an array of find-locations,
  which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
  foundPos:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := 0;
  foundPos := 1;
  fromPos := 1;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin
        // Found, return.
        //Result := Index - kIndex + 1;
        Index := Index + LengthPattern;
        fromPos := Index;
        Inc(Result);
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin

  end;
end;

これは本当に優れたアルゴリズムです。理由は次のとおりです。

このように、文字列 Y 内の部分文字列 X のインスタンスをカウントする方がはるかに高速です。
Pos() を置き換えるだけの場合、_FindStringBoyer() は、現在 Pos に使用されている FastCode プロジェクトの人々によって Delphi に提供された Pos() の純粋な asm バージョンよりも高速です。大文字と小文字を区別しない必要がある場合は、パフォーマンスを想像できます。 100 メガバイトの文字列で UpperCase を呼び出す必要がない場合はブーストします。(わかりました、文字列はそれほど大きくなりません。それでも、効率的なアルゴリズムは美しいものです。)

さて、Boyer-Moore スタイルで String Replace を書きました。

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
  errors:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
  CaseSensitive:Boolean;
  foundAt:Integer;
  lastFoundAt:Integer;
  copyStartsAt:Integer;
  copyLen:Integer;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := '';
  lastFoundAt := 0;
  fromPos := 1;
  errors := 0;
  CaseSensitive := rfIgnoreCase in Flags;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    foundAt := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin


        foundAt := Index - kIndex + 1;
        Index := Index + LengthPattern;
        //fromPos := Index;
        fromPos := (foundAt+LengthPattern);
        if lastFoundAt=0 then begin
                copyStartsAt := 1;
                copyLen := foundAt-copyStartsAt;
        end else begin
                copyStartsAt := lastFoundAt+LengthPattern;
                copyLen := foundAt-copyStartsAt;
        end;

        if (copyLen<=0)or(copyStartsAt<=0) then begin
                Inc(errors);
        end;

        Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
        lastFoundAt := foundAt;
        if not (rfReplaceAll in Flags) then
                 fromPos := 0; // break out of outer while loop too!
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin
  end;
  if (lastFoundAt=0) then
  begin
     // nothing was found, just return whole original string
      Result := aSourceString;
  end
  else
  if (lastFoundAt+LengthPattern < Limit) then begin
     // the part that didn't require any replacing, because nothing more was found,
     // or rfReplaceAll flag was not specified, is copied at the
     // end as the final step.
    copyStartsAt := lastFoundAt+LengthPattern;
    copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
    Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
  end;

end;

さて、問題：これのスタックフットプリント：

var
  skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi

さよならCPU地獄、こんにちはスタック地獄。動的配列を使用する場合は、実行時にサイズを変更する必要があります。したがって、これは基本的に高速です。これは、コンピューターの仮想メモリシステムが 256K のスタックで点滅しないためですが、これは常に最適なコードであるとは限りません。それにもかかわらず、私の PC は、このような大きなスタックにまばたきをしません。それは、Delphi 標準ライブラリのデフォルトになることも、将来的に高速コードの課題に勝つこともありません。繰り返し検索は上記コードをクラスとして記述し、スキップテーブルをそのクラス内のデータフィールドにするケースだと思います。次に、boyer-moore テーブルを 1 回作成し、時間の経過とともに、文字列が不変である場合は、そのオブジェクトを繰り返し使用して高速なルックアップを行うことができます。

score 2 · Accepted Answer

私はちょうど同じものを探していたので、Jedi JCL には、jclUnicode.pas で Boyer-Moore を使用するユニコード対応の検索エンジンがあります。どれだけ速いか、どれだけ速いかはまだわかりません。

algorithm - Delphi 2010 String (UnicodeString) の Boyer-Moore 文字列検索、高速検索および置換機能、高速文字列カウントはありますか?

2 に答える 2

Related

Reference