c - Cで（いくつかの）文字列を照合する最も効率的な方法は？

Question

私たちのシステムは、端末からのユーザー入力を受け入れ、いくつかの既知のキーワード文字列（おそらく10）と照合する必要があります。

正規表現などを実行するためのスペース/コンピューターがありません。コードは小さくて高速である必要があります。

さて、これを行うための厄介な方法は次のとおりです。

   // str is null-terminated, assume we know it's safe/sane here
   if(!strncmp(str,"hello",5)
   {
      do_hello();
   }
   else if(!strncmp(str,"world",5)
   {
      do_world();
   }
   else
   {
      meh(); // Wasn't a match
   }

したがって、少しグーグルして読んだ後、さまざまな一致のハッシュをintとして事前に計算してから、caseステートメントを使用する方が良い方法であると確信しています。

// Assume hash() stops at NULL
switch(hash(str))
{
   case HASH_OF_HELLO:
      do_hello();
      break;

   case HASH_OF_WORLD:
      do_world();
      break;

   default:
      meh();
      break;
}

コンパイル時に*HASH_OF_match*を計算できます。これは、比較的小さなセットから文字列を選択するための、より高速でエレガントな方法のようです。

だから-これは合理的に見えますか？/これを行うことで明白な問題がありますか？/誰かがそれを行うためのよりエレガントな方法を手に入れましたか？

脚注として、これは私が今日の午後に見た中で最も見栄えの良いハッシュアルゴリズムです;）、dan bernsteinの功績によるもので、目前の仕事を尊敬しています。

unsigned int
get_hash(const char* s)
{
    unsigned int hash = 0;
    int c;

    while((c = *s++))
    {
        // hash = hash * 33 ^ c 
        hash = ((hash << 5) + hash) ^ c;
    }

    return hash;
}

score 5 · Accepted Answer

ハッシュの問題は、ユーザーが入力した任意の文字列が、一致するものの1つと同じハッシュを生成し、間違ったものを実行する可能性があることです。10程度の小さな検索セットの場合、私はこのif-elseアプローチに固執します。または、文字列配列と関数ポインタ配列（すべての関数が同じシグネチャを持っていると仮定）を使用して、実行する関数を選択します。

char const *matches[10] = {"first", "second", ..., "tenth"};
void (*fn[10])(void) = {&do_first, &do_second, ..., &do_tenth};

for( i = 0; i < 10; ++i ) {
  if( strcmp( str, matches[i] ) == 0 ) {
    (*fn[i])();
  }
}

score 3 · Accepted Answer

Boyer-Moore文字列検索アルゴリズムのように、最後の文字にネストされたswitchステートメントを使用するのはどうですか？

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

score 1 · Accepted Answer

1

gperfを使いたいようですね。

于 2012-09-04T18:39:26.090 に答える

score 1 · Accepted Answer

ハッシュテーブルとハッシュテーブルは、大量のデータに最適です。入力文字列の数は既知で制限されているため、おそらくこのアプローチを検討できます。

既知の文字列が

const char* STR_TABLE [STR_N] =
{
  "hello",
  "world",
  "this",
  "is",
  "a",
  "number",
  "of",
  "ten",
  "test",
  "strings"
};

次に、コンパイルする前に、アルファベット順に手動で並べ替えることができます。これは、並べ替えられたテーブルを使用すると、検索がはるかに高速になるためです。その後、二分探索を使用できます。

#include <stdio.h>
#include <stdlib.h>

#define STR_N 10


const char* STR_TABLE [STR_N] =
{
  "a",
  "hello",
  "is",
  "number",
  "of",
  "strings",
  "ten",
  "test",
  "this",
  "world"
};


int ptr_strcmp (const void* str1, const void* str2)
{
  return strcmp(str1, *(const char**)str2);
}

int main()
{
  const char* user_input = "world"; // worst case
  const char** result;

  result = bsearch (user_input,
                    STR_TABLE,
                    STR_N,
                    sizeof(const char*),
                    ptr_strcmp);

  if(result != NULL)
  {
    printf("%s\n", *result);
  }
  else
  {
    printf("meh\n");
  }

}

これは要約すると次のようになります。

「world」を「of」と比較し、1つの比較'w'！='o'。

「world」を「test」と比較し、1つの比較'w'！='t'。

「world」と「this」を比較します。1つの比較'w'！='t'。

「世界」と「世界」を比較します。5つの比較。

比較の総数は8です。

もちろん、これにはいくつかのオーバーヘッドコードが含まれ、「\0」とバイナリ検索呼び出しをチェックします。最適な方法を見つけるには、特定のプラットフォームで提案されているさまざまな方法を測定する必要があります。

score 0 · Accepted Answer

多分解決策はこれかもしれません：

struct keyword {
    unsigned int hash;
    const char *str;
    void (*job)();
};

//A table with our keywords with their corresponding hashes. If you could not
//compute the hash at compile time, a simple init() function at the beginning
//of your program could initialize each entry by using the value in 'str'
//You could also implement a dynamic version of this table (linked list of keywords)
//for extending your keyword table during runtime
struct keyword mykeywords[] = {
    {.hash = HASH_OF_HELLO, .str = "hello", .job = do_hello},
    {.hash = HASH_OF_WORLD, .str = "world", .job = do_world},
    ...
    {.str = 0} //signal end of list of keywords

};

void run(const char *cmd)
{
    unsigned int cmdhash = get_hash(cmd);
    struct keyword *kw = mykeywords;
    while(kw->str) {
        //If hash matches then compare the string, since we should consider hashing collisions too!
        //The order of conditions below is important
        if (kw->hash == cmdhash && !strcmp(cmd, kw->str)) { 
             kw->job();
             break;   
        }
        kw++;
    }
}

c - Cで（いくつかの）文字列を照合する最も効率的な方法は？

5 に答える 5

Related

Reference