c++ - C/C++で大文字と小文字を区別しない部分文字列検索を行う最速の方法は?

Question

ノート

以下の質問は、2003 年の一部のコードについて 2008 年に尋ねられました。OP の更新が示すように、この投稿全体は 2008 年のビンテージアルゴリズムによって廃止されており、歴史的な好奇心としてのみここに残っています。

C/C++ で、大文字と小文字を区別しない部分文字列検索を高速に行う必要があります。私の要件は次のとおりです。

strstr() のように動作する必要があります (つまり、一致点へのポインターを返します)。
大文字と小文字を区別しない必要があります (doh)。
現在のロケールをサポートする必要があります。
Windows (MSVC++ 8.0) で利用できるか、Windows に簡単に移植できる (つまり、オープンソースライブラリから) 必要があります。

これが私が使用している現在の実装です（GNU Cライブラリから取得）：

/* Return the offset of one string within another.
   Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */

/*
 * Modified to use table lookup instead of tolower(), since tolower() isn't
 * worth s*** on Windows.
 *
 * -- Anders Sandvig (anders@wincue.org)
 */

#if HAVE_CONFIG_H
# include <config.h>
#endif

#include <ctype.h>
#include <string.h>

typedef unsigned chartype;

char char_table[256];

void init_stristr(void)
{
  int i;
  char string[2];

  string[1] = '\0';
  for (i = 0; i < 256; i++)
  {
    string[0] = i;
    _strlwr(string);
    char_table[i] = string[0];
  }
}

#define my_tolower(a) ((chartype) char_table[a])

char *
my_stristr (phaystack, pneedle)
     const char *phaystack;
     const char *pneedle;
{
  register const unsigned char *haystack, *needle;
  register chartype b, c;

  haystack = (const unsigned char *) phaystack;
  needle = (const unsigned char *) pneedle;

  b = my_tolower (*needle); 
  if (b != '\0')
  {
    haystack--;             /* possible ANSI violation */
    do
      {
        c = *++haystack;
        if (c == '\0')
          goto ret0;
      }
    while (my_tolower (c) != (int) b);

    c = my_tolower (*++needle);
    if (c == '\0')
        goto foundneedle;

    ++needle;
    goto jin;

    for (;;)
    {
      register chartype a;
        register const unsigned char *rhaystack, *rneedle;

        do
        {
          a = *++haystack;
          if (a == '\0')
              goto ret0;
          if (my_tolower (a) == (int) b)
              break;
          a = *++haystack;
          if (a == '\0')
              goto ret0;
        shloop:
          ;
        }
      while (my_tolower (a) != (int) b);

jin:      
      a = *++haystack;
      if (a == '\0')
          goto ret0;

        if (my_tolower (a) != (int) c)
          goto shloop;

        rhaystack = haystack-- + 1;
        rneedle = needle;

        a = my_tolower (*rneedle);

        if (my_tolower (*rhaystack) == (int) a)
          do
          {
              if (a == '\0')
                goto foundneedle;

              ++rhaystack;
          a = my_tolower (*++needle);
              if (my_tolower (*rhaystack) != (int) a)
                break;

          if (a == '\0')
                goto foundneedle;

          ++rhaystack;
              a = my_tolower (*++needle);
          }
          while (my_tolower (*rhaystack) == (int) a);

        needle = rneedle;       /* took the register-poor approach */

      if (a == '\0')
          break;
    }
  }
foundneedle:
  return (char*) haystack;
ret0:
  return 0;
}

このコードを高速化できますか、またはより良い実装を知っていますか?

注: GNU C ライブラリにの新しい実装が追加されたstrstr()ことに気付きましたが、大文字と小文字を区別しないように簡単に変更できるかどうか、または実際に古いライブラリよりも高速であるかどうかはわかりません (私の場合)。また、古い実装がまだワイド文字列に使用されていることに気付いたので、理由を知っている人は共有してください。

アップデート

明確にするために (まだ作成していない場合に備えて)、この関数は作成していません。これは GNU C ライブラリの一部です。大文字と小文字を区別しないように変更しただけです。

strcasestr()また、他のソース (OpenBSD、FreeBSD など) からの他の実装についてのヒントとチェックアウトに感謝します。それは行く道のようです。上記のコードは 2003 年のものです。そのため、より良いバージョンが利用可能になることを期待してここに投稿しました。:)

score 13 · Accepted Answer

あなたが投稿したコードはの約半分の速さstrcasestrです。

$ gcc -Wall -o my_stristr my_stristr.c
steve@solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c 
steve@solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve@solaris:~/code/tmp
$ cat my_stristr.result 
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve@solaris:~/code/tmp
$ cat strcasestr.result 
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve@solaris:~/code/tmp

main機能は次のとおりです。

int main(void)
{
        char * needle="hello";
        char haystack[1024];
        int i;

        for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
        {
                haystack[i]='A'+i%57;
        }
        memcpy(haystack+i,needle, strlen(needle)+1);
        /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
        init_stristr();

        for (i=0;i<1000000;++i)
        {
                /*my_stristr(haystack, needle);*/
                strcasestr(haystack,needle);
        }


        return 0;
}

両方の実装をテストするために適切に変更されました。これを入力していると、電話に出たままになっていることに気付きましたinit_stristrが、状況はあまり変わらないはずです。bench単純なシェルスクリプトです。

#!/bin/bash
function bc_calc()
{
        echo $(echo "scale=4;$1" | bc)
}
time="/usr/bin/time -p"
prog="$1"
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
        echo -n "run $a... "
        t=$($time $prog 2>&1| grep user | awk '{print $2}')
        echo "time = $t"
        accum=$(bc_calc "$accum+$t")
done

echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")

score 3 · Accepted Answer

プラットフォームに依存しない使用の場合:

const wchar_t *szk_wcsstri(const wchar_t *s1, const wchar_t *s2)
{
    if (s1 == NULL || s2 == NULL) return NULL;
    const wchar_t *cpws1 = s1, *cpws1_, *cpws2;
    char ch1, ch2;
    bool bSame;

    while (*cpws1 != L'\0')
    {
        bSame = true;
        if (*cpws1 != *s2)
        {
            ch1 = towlower(*cpws1);
            ch2 = towlower(*s2);

            if (ch1 == ch2)
                bSame = true;
        }

        if (true == bSame)
        {
            cpws1_ = cpws1;
            cpws2 = s2;
            while (*cpws1_ != L'\0')
            {
                ch1 = towlower(*cpws1_);
                ch2 = towlower(*cpws2);

                if (ch1 != ch2)
                    break;

                cpws2++;

                if (*cpws2 == L'\0')
                    return cpws1_-(cpws2 - s2 - 0x01);
                cpws1_++;
            }
        }
        cpws1++;
    }
    return NULL;
}

score 3 · Accepted Answer

ブースト文字列アルゴリズムを使用します。クロスプラットフォームで利用でき、ヘッダーファイルのみです (リンクするライブラリはありません)。とにかくブーストを使用する必要があることは言うまでもありません。

#include <boost/algorithm/string/find.hpp>

const char* istrstr( const char* haystack, const char* needle )
{
   using namespace boost;
   iterator_range<char*> result = ifind_first( haystack, needle );
   if( result ) return result.begin();

   return NULL;
}

score 2 · Accepted Answer

なぜ _strlwr(string); を使用するのですか? init_stristr() で? 標準機能ではありません。おそらくロケールのサポートのためですが、標準ではないため、次を使用します。

char_table[i] = tolower(i);

score 1 · Accepted Answer

既に存在する一般的な strcasestr 実装のいくつかを使用することをお勧めします。たとえば、glib、glibc、OpenBSD、FreeBSD などです。google.com/codesearch でさらに検索できます。次に、いくつかのパフォーマンス測定を行い、異なる実装を比較できます。

score 1 · Accepted Answer

両方の入力文字列が既に小文字であると仮定します。

int StringInStringFindFirst(const char* p_cText, const char* p_cSearchText)
{
    int iTextSize = strlen(p_cText);
    int iSearchTextSize = strlen(p_cSearchText);

    char* p_cFound = NULL;

    if(iTextSize >= iSearchTextSize)
    {
        int iCounter = 0;
        while((iCounter + iSearchTextSize) <= iTextSize)
        {
            if(memcmp( (p_cText + iCounter), p_cSearchText, iSearchTextSize) == 0)
                return  iCounter;
            iCounter ++;
        }
    }

    return -1;
}

また、マスクを使用してみてください...たとえば、比較しようとしているほとんどの文字列に a から z までの文字しか含まれていない場合は、このようなことをする価値があるかもしれません。

long GetStringMask(const char* p_cText)
{
    long lMask=0;

    while(*p_cText != '\0')
    {       
        if (*p_cText>='a' && *p_cText<='z')
            lMask = lMask | (1 << (*p_cText - 'a') );
        else if(*p_cText != ' ')
        {
            lMask = 0;
            break;      
        }

        p_cText ++;
    }
    return lMask;
}

それで...

int main(int argc, char* argv[])
{

    char* p_cText = "this is a test";   
    char* p_cSearchText = "test";

    long lTextMask = GetStringMask(p_cText);
    long lSearchMask = GetStringMask(p_cSearchText);

    int iFoundAt = -1;
    // If Both masks are Valid
    if(lTextMask != 0 && lSearchMask != 0)
    {
        if((lTextMask & lSearchMask) == lSearchMask)
        {       
             iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
        }
    }
    else
    {
        iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
    }


    return 0;
}

score 0 · Accepted Answer

CPU サイクルを削減したい場合は、これを検討してください。Unicode ではなく ASCII を扱っていると仮定しましょう。

256 エントリの静的テーブルを作成します。テーブルの各エントリは 256 ビットです。

2 つの文字が等しいかどうかをテストするには、次のようにします。

if (BitLookup(table[char1], char2)) { /* match */ }

テーブルを作成するには、table[char1] のあらゆる場所にビットを設定し、char2 に一致すると見なします。したがって、テーブルを作成する際に、'a' 番目のエントリ (および 'A' 番目のエントリ) の 'a' と 'A' のインデックスにビットを設定します。

これは、ビットルックアップを実行するのが遅くなるため (ビットルックアップは、シフト、マスク、および加算である可能性が最も高い)、代わりにバイトテーブルを使用して、1 ビットを表すのに 8 ビットを使用することができます。これには 32K かかります - 万歳 - 時間と空間のトレードオフにぶつかりました! テーブルをより柔軟にしたいかもしれないので、代わりにこれを行うとしましょう - テーブルは代わりに合同を定義します。

2 つの文字は、同等であると定義する関数が存在する場合に限り、合同であると見なされます。したがって、'A' と 'a' は大文字と小文字を区別しないため合同です。「A」、「À」、「Á」、および「Â」は、分音記号を区別しないために合同です。

したがって、合同性に対応するビットフィールドを定義します

#define kCongruentCase (1 << 0)
#define kCongruentDiacritical (1 << 1)
#define kCongruentVowel (1 << 2)
#define kCongruentConsonant (1 << 3)

次に、テストは次のようになります。

inline bool CharsAreCongruent(char c1, char c2, unsigned char congruency)
{
    return (_congruencyTable[c1][c2] & congruency) != 0;
}

#define CaseInsensitiveCharEqual(c1, c2) CharsAreCongruent(c1, c2, kCongruentCase)

この種の巨大なテーブルを少しいじるのは、ctype の核心です。

score 0 · Accepted Answer

常に小文字になるように針の文字列を制御できる場合は、変更されたバージョンの stristr() を記述して、その検索を回避し、コードを高速化できます。それほど一般的ではありませんが、より高速になる可能性があります - わずかに高速です。同様のコメントが干し草の山にも当てはまりますが、データが要件を満たしていると確信できないため、管理外のソースから干し草の山を読み取る可能性が高くなります。

パフォーマンスの向上がそれだけの価値があるかどうかは、まったく別の問題です。アプリケーションの 99% について、答えは「いいえ、価値がありません」です。あなたのアプリケーションは、それが重要な少数派の 1 つかもしれません。そうではない可能性が高いです。

c++ - C/C++で大文字と小文字を区別しない部分文字列検索を行う最速の方法は?

ノート

10 に答える 10

Related

Reference