c - この状況で最小限の完全なハッシュ関数を作成することは可能ですか?

Question

キーと値のペアを格納するために、ハッシュマップ (または提案があれば別の構造) を作成したいと考えています。キーはすべて、マップが作成されると同時に一度に挿入されますが、マップを作成する必要がある実行時まで、キーが何になるか (任意の長さの文字列) はわかりません。

このようなクエリ文字列を解析しています"x=100&name=bob&color=red&y=150"（ただし、文字列には無制限の数の変数を含めることができ、変数には任意の長さの名前を付けることができます）。

私はそれを一度解析してハッシュマップを作成したいと思います。できれば最小限で、線形ストレージ要件を満たすために完全なハッシュ関数を使用します。マップが作成されると、値は変更または削除されず、キーと値のペアもマップに追加されないため、マップ全体が事実上定数になります。文字列内で変数が 2 回出現しないことを前提としています (IE."x=1&x=2"は無効です)。

私はでコーディングしており、現在、 string を返すCように使用できる関数を持っていますが、毎回クエリ文字列を解析するため、時間がかかります。非常に大きなクエリ文字列であり、すべての値が数回読み取られるため、最初にロードされたときに一度解析したいと思います。を使用していますが、答えとしてコードを入力する必要はありません。疑似コード、またはすべての提案は素晴らしいでしょう!get("x")"100"O(n)CC

score 8 · Accepted Answer

GPL のgperf、またはC での Bob Jenkins のパブリックドメイン実装を試してください。

手順：

クエリ文字列を受け取り、キーのリストを列挙して完全なハッシュ関数のドメインを特定する
これらのキーとリストサイズ (範囲は 1..size になります) を、上記の参照実装から派生した完全ハッシュ生成関数に提供します。
生成された完全なハッシュ関数を使用して HashMap を作成します
同じ完全ハッシュ関数を使用getして、HashMap でリクエストを処理します

リファレンス実装は C ソースコードで完全なハッシュ関数を出力するため、代わりに VM のバイトコードのようなものを生成するように変更する必要があると、以下のコメントで Necrolis を編集します。組み込みの Scheme や Lua などの解釈言語を使用することもできます。

完全なハッシュ関数を作成するオーバーヘッドがルックアップで償却される場合、これが単純な (完全ではない) HashMap よりも努力する価値があるかどうかを知ることは興味深いでしょう。

別のオプションは、O(1) ルックアップも持つCuckoo ハッシュです。

score 2 · Accepted Answer

非常に優れたハッシュルーチンがいくつかあります。ただし、そのうちの1つがほぼ完全であることを証明するには、入力に関する多くの知識が必要です。あなたの入力は、そのような証明をほぼ不可能にするほど十分に制約されていないようです。

一般的に言えば、完全な（またはほぼ完全な）ルーチンは、入力の各ビット/バイトに敏感です。速度については、組み合わせ操作は通常XORです。このようなルーチンが2つの同一のバイトが互いに打ち消し合うのを防ぐ方法は、ビットをシフトまたはローテーションすることです。ただし、このようなシフトは、表現できる最大数に対する互いに素な数で行う必要があります。そうしないと、入力のパターンが前の入力によって部分的にキャンセルされる可能性があります。これにより、ソリューションのエントロピーが減少し、衝突の可能性が高くなります。

典型的な解決策は

Start with a number that is a prime (all primes are relative primes)
while (more bytes to be considered) {
  take the next byte of input and multiply it by a second prime
  determine the number of bits that might be lost in a left shift, capture them in a buffer
  shift the bits in the hash "buffer" to the left.
  restore the high order bit(s) in the low position
  take the next byte of input and multiply it by a second prime
  mask the multiplied result into the buffer 
}

そのようなルーチンの問題は知られています。基本的に入力の変動が少ないため、入力の分散は理想的ではありません。とは言うものの、この手法では、最初のプライム開始番号から離れるのに十分な入力があれば、出力のドメイン全体に入力ビットを適切に分散させることができます。残念ながら、ランダムな開始番号を選択することは解決策ではありません。そうすると、ハッシュを正確に再計算することが不可能になるためです。

いずれの場合も、乗算で使用される素数は乗算をオーバーフローさせてはなりません。同様に、最初の入力の分散効果が失われないようにする場合は、上位ビットのキャプチャを下位ビットに置き換える必要があります（結果は後のビット/バイトのみにグループ化されます）。素数の選択は分散に影響し、効果を上げるにはチューニングが必要になる場合があります。

これで、ほぼ完全なハッシュは、まともなほぼ完全ではないハッシュよりも計算時間が長くなることが簡単にわかります。ハッシュアルゴリズムは衝突を考慮して設計されており、ほとんどのJavaハッシュ構造は占有しきい値（通常は70％の範囲ですが、調整可能）でサイズ変更されます。サイズ変更が組み込まれているため、ひどいハッシュを記述しない限り、Javaデータ構造は衝突の可能性が少なくなるように再調整し続けます。

ハッシュを高速化できる最適化には、ビットのグループでの計算、ときどきバイトの削除、一般的に使用される乗算された数値のルックアップテーブルの事前計算（入力によってインデックス付け）などが含まれます。アーキテクチャによっては、最適化が高速であると思い込まないでください。マシンの詳細、および最適化の「経過時間」、場合によっては、最適化の仮定がもはや成り立たず、最適化を適用すると、実際にハッシュを計算する時間が長くなります。

score 1 · Accepted Answer

あなたが説明していることには、完全なハッシュのようなものはありません。完全なハッシュは元の入力になります。データが特定のもの (ラテン語ベースの ASCII や特定のキーのみなど) のみであることが保証されている場合は、適切にハッシュできますが、完璧ですか? いいえ、できません。リンクリストまたはベクトルハッシュミスメカニズムも作成する必要があります。システム内のバリアント (あなたの場合の入力数など) は、完全なハッシュの概念を無効にします。

あなたが望むものは、数学の法則に逆らいます。

ほぼ O(1) を達成できますが、ここには未回答の質問があります。質問は次のとおりです。

なぜ線形ストレージでなければならないのですか?
テーブルからの削除は一般的ですか (最初の作成後にキーと値のペアが追加されないように指定しただけです)?
ハッシュの範囲と比較して、テーブルはどのくらい大きくなる可能性がありますか?
繰り返しデータと比較して、挿入はどのくらいの頻度ですか?
メモリは重要な要素ですか？

完全なハッシュは不可能ですが、潜在的な一意のハッシュの平均値から少なくとも標準偏差の 2 倍のバケットサイズを持つ単純なリンクリストを作成できれば、完全に学術的なものになります。それは最小限のメモリ (もちろん相対的に言えば、潜在的な合計サイズに応じて) であり、削除しやすく、質問 3 が「はるかに小さい」のような答えである限り、ほぼO(1) ルックアップ時間になります。

以下はあなたが始めるためのものですが、どのハッシュアルゴリズムを使用するかについての決定はあなたに任せます...

#include <stdlib.h>
#include <string.h>
#include <stdint.h>


// Dummy value type to test compile. Replace throughout 
#define YOUR_VALUE_T int

// See below where the charmap is
//#define HTABLE_USE_CHARMAP
// Maintain a true linked list that's manageable and iterateable
#define HTABLE_MAINTAIN_LIST
// Count lookup misses and such
#define HTABLE_KEEP_STATS
// Fast deletion = faster deletion but more memory consumption
//#define HTABLE_FAST_DELETES

#ifdef HTABLE_USE_CHARMAP
// This is used to quickly collapse the input from full 8-bit to the minimal character set of truely expected data.
// The idea here is to boil down the data. This should only be done if you're confident enough to develop a custom
// hashing algorithm for this particular known range
const char hashing_charmap[256] = {
   // Each data point that is unused (such as control characters or high 8-bit characters)
   // should be 0, while each used character should be represented with a unique sequential value (1, 2, 3, etc)
   // I'm not going to build this for you because it's very custom to your needs.
   // A chunk might look look like...
   /*
   0, 0, 0, 0, 17, 18, 19, 0, 0, 20, 21,
   */
};
#endif

static inline uint32_t hash_str(register const char* s, const size_t len) {
   register uint32_t hash = 5381; // hash seed here. This could be different depending on the actual algorithm chosen
   register char symbol;
   // This could be unrolled because we known string length as well.
   for (register size_t i=0; i < len; i++) {
      #ifdef HTABLE_USE_CHARMAP
      if (!(symbol = hash_charmap[s[i]]))
         continue;
      #else
      // Actually s[i] could simply be used (which would be faster) if no mapping is needed.
      symbol = s[i];
      #endif
      // True hash algorithm per-symbol operation here
      /*
      Keep in mind that certain algorithms are optimized for certain things.
      An example:
      Stock DJBX33A is very fast but effectively only represents the end of a long input. It's really meant for short inputs (like variable names)
      A MurmurHash or tuned FNV variant are likely to be a good picks since we've reduced symbol range and we are dealing with potential long inputs.
      It's also important to understand that the entire hash will likely not be used. Only the lower-end bits will be used
      (you'll see why in the actual functionality). If you're hashing algorithm is good though, this shouldn't matter because
      the distribution should be normal.
      I'll just use Jenkins one-at-a-time hash here (because it's easy)
      */
      hash += symbol;
      hash += (hash << 10);
      hash ^= (hash >> 6);
   }
   // Finialize jenkins one-at-a-time
   hash += (hash << 3);
   hash ^= (hash >> 11);
   hash += (hash << 15);
   return hash;
};


typedef struct _hash_entry {
   char* key;
   size_t key_len;
   uint32_t hash;
   // Whatever your value type is (likely a pointer to your own record or something)
   YOUR_VALUE_T value;
   // Internal linking maintains order.
   // If you don't need proper order maintentence, you don't need these
   #ifdef HTABLE_MAINTAIN_LIST
   struct _hash_entry* prev;
   struct _hash_entry* next;
   #endif

   #ifdef HTABLE_FAST_DELETES
   struct _hash_entry* bucket_prev;
   #endif
   // This is required for the occassional hash miss
   struct _hash_entry* bucket_next;
} hash_entry_t;


typedef struct _hash_table {
   // Counts
   size_t entry_count;
   uint32_t bucket_count;
   unsigned int growth_num;
   unsigned int growth_den;
   #ifdef HTABLE_KEEP_STATS
   // How many times we missed during lookup
   size_t misses;
   // (entry_count - used_buckets) tells you how many collisions there are (the lower the better)
   uint32_t used_buckets;
   #endif
   // Internal linking. Same conditions as in hash_entry_t so feel free to remove as necessary.
   #ifdef HTABLE_MAINTAIN_LIST
   hash_entry_t* first;
   hash_entry_t* last;
   #endif
   // Buckets, the soul of the hash table
   uint32_t hash_mask;
   hash_entry_t** buckets;
} hash_table_t;



// Creates a hash table
// size_hint - Tells to table how many buckets it should initially allocate.
//    If you know (for example) that you'll have about 500 entries, set it
//    to 500
// growth_num and growth_den - This is the ratio of how many entries to how
//    many buckets that you want to guarantee.
//    It's in two integers to avoid floating point math for speed.
//    The logic after an insertion is...
//       if (entry_count == growth_num * (bucket_count / growth_den)) then
//          grow the bucket array
//    For example, when growth_num is 4 and growth_den is 5...
//       (entry_count == 4 * (bucket_count / 5))
//   ...would be true when entry count is 80% of the bucket count
//    This can result in a greater than 1.0 ratio (such as 5/4 or something
//    like that) if you prefer. This would mean that there are less buckets
//    than there are entries, so collisions are guaranteed at that point, but
//    you would save on both memory and often a bucket expansion occurs (which
//    is costly during an insert operation).
static hash_table_t* htable_create(const size_t size_hint, const unsigned int growth_num, const unsigned int growth_den);
// Frees a hash table
static void htable_free(hash_table_t* table);
// Mostly used internally. You probably want htable_get(), htable_value(), or htable_exists()
static hash_entry_t* htable_find_entry(hash_table_t* table, const char* key, size_t key_len, uint32_t* hash, size_t* true_len);
// Get the pointer to a value stored in the table (or NULL on non-existant)
static YOUR_VALUE_T* htable_value(const hash_table_t* table, const char* key, size_t key_len);
// Get the value of an entry, or the default value if the entry doesn't exist
static YOUR_VALUE_T htable_get(const hash_table_t* table, const char* key, size_t key_len, const YOUR_VALUE_T default_value);
// Test for the existance of a value
static int htable_exists(const hash_table_t* table, const char* key, size_t key_len);
// Add a new entry (but don't update if it already exists). Returns NULL if it already exists
static hash_entry_t* htable_add(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Update an entry OR add a a new entry it doesn't already exist
static hash_entry_t* htable_set(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Update an entry but don't add a a new entry it doesn't already exist. Returns NULL if doesn't exist
static hash_entry_t* htable_update(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value);
// Delete an entry. Returns 1 on success or 0 if the entry didn't exist
static int htable_delete(hash_table_t* table, const char* key, size_t key_len);
// Pack the table.
// This is here because...
// - If HTABLE_FAST_DELETES is set, and if you delete a bunch of entries, it's
//   possible that you can free up some memory by shrinking the bucket array.
//   You would have to call this manually to make that happen.
// - If HTABLE_FAST_DELETES is NOT set however, this get's called automatically
//   on each delete, so the buckets are guaranteed to be packed.
static void htable_pack(hash_table_t* table);



/*********************************\
Implementation...
\*********************************/
static hash_table_t* htable_create(const unsigned long size_hint, const unsigned int growth_num, const unsigned int growth_den) {
   hash_table_t* res = malloc(sizeof(hash_table_t));
   if (!res)
      return NULL;
   res->entry_count = 0;
   #ifdef HTABLE_MAINTAIN_LIST
   res->first = NULL;
   res->last = NULL;
   #endif

   #ifdef HTABLE_KEEP_STATS
   res->misses = 0; 
   res->used_buckets = 0;
   #endif
   if ((!growth_num) || (!growth_den)) {
      // Grow only when the entry count matches the bucket count
      res->growth_num = 1;
      res->growth_den = 1;
   } else {
      res->growth_num = growth_num;
      res->growth_den = growth_den;
   }
   /*
   For computational speed and simplicity we'll grow the bucket array exponentially.
   Not growing the buckets exponentially is possible but requires a different
   entry lookup mechanism (because hash & hash_mask would no longer work) and would 
   likely involve the modulas operator which is very slow. If memory is uber important
   however, this might be a good solution.
   */
   // We'll go ahead and assume it's a reasonably small table and only allocate 256 buckets.
   int bits = 8;
   if (size_hint) {
      unsigned long target = (size_hint * res->growth_den) / res->growth_num;
      // First check is to prevent overflow as it would be 0 when bits is 31 on a 32 bit system
      while ((1 << (bits + 1)) && ((1 << bits) < target))
         bits++;
   }
   res->bucket_count = 1 << bits;
   res->hash_mask = (1 << bits) - 1;
   if ((res->buckets = (hash_entry_t**)calloc(res->bucket_count, sizeof(hash_entry_t*))) == NULL) {
      free(res);
      return NULL;
   }
   memset(res->buckets, 0, sizeof(hash_entry_t*) * res->bucket_count);
   return res;
};

// Destroy a table
static void htable_free(hash_table_t* table) {
   hash_entry_t* entry;
   hash_entry_t* next;
   #ifdef HTABLE_MAINTAIN_LIST
      entry = table->first;
      while (entry) {
         next = entry->next;
         free(entry->key);
         free(entry);
         entry = next;
      }
   #else
      for (uint32_t i=0; i < table->bucket_count; i++) {
         entry = table->buckets[i];
         while (entry) {
            next = entry->bucket_next;
            free(entry->key);
            free(entry);
            entry = next;
         }
      }
   #endif
   free(table->buckets);
   free(table);
}

// Find an entry: (mostly used internally)
// returns NULL when the entry isn't found
static hash_entry_t* htable_find_entry(hash_table_t* table, const char* key, size_t key_len, uint32_t* hash, size_t* true_len) {
   if (!key_len)
      key_len = strlen(key);
   if (true_len != NULL)
      *true_len = key_len;
   uint32_t h = hash_str(key, key_len);
   if (hash != NULL)
      *hash = h;
   uint32_t bucket = h & table->hash_mask;
   // Best case is here is O(1) because table->buckets[bucket] would be the entry
   hash_entry_t* entry = table->buckets[bucket];
   // ... but if we miss, then the time increases to as much as O(n) where n is the number of entries in
   // the particular bucket (good hash + good ratio management means that n would usually be only 1)
   while ((entry) && ((entry->hash != h) || (entry->key_len != key_len) || (memcmp(entry->key, key, key_len)))) {
      #ifdef HTABLE_KEEP_STATS
      table->misses++;
      #endif
      entry = entry->bucket_next;
   }
   return entry;
}


// Insertion of entry into bucket. Used internally
static inline int _htable_bucket_insert(hash_entry_t** buckets, hash_entry_t* entry, const uint32_t hash_mask) {
   hash_entry_t* bentry;
   #ifdef HTABLE_FAST_DELETES
      entry->bucket_prev = NULL;
   #endif
   entry->bucket_next = NULL;
   uint32_t bidx = entry->hash & hash_mask;
   int res = 0;
   if ((bentry = buckets[bidx]) == NULL) {
      res = 1;
      buckets[bidx] = entry;
   } else {
      while (bentry->bucket_next)
         bentry = bentry->bucket_next;
      bentry->bucket_next = entry;
      #ifdef HTABLE_FAST_DELETES
         entry->bucket_prev = bentry;
      #endif
   }
   return res;
}

// Bucket array growing/shrinking. Used internally
static void _htable_adjust_as_needed(hash_table_t* table) {
   int change = (((table->bucket_count << 1) != 0) && (table->entry_count >= table->growth_num * (table->bucket_count / table->growth_den)));
   if (!change) {
      if ((table->bucket_count > (1 << 8)) && (table->entry_count < table->growth_num * ((table->bucket_count >> 1) / table->growth_den))) {
         change = -1;
      } else {
         return;
      }
   }
   uint32_t new_bucket_count = (change < 0) ? table->bucket_count >> 1 : table->bucket_count << 1;
   uint32_t new_hash_mask = new_bucket_count - 1;
   hash_entry_t** new_buckets = (hash_entry_t**)calloc(new_bucket_count, sizeof(hash_entry_t*));
   if (!new_buckets)
      return;
   memset(new_buckets, 0, new_bucket_count * sizeof(hash_entry_t*));
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets = 0;
   #endif
   hash_entry_t* entry;
   #ifdef HTABLE_MAINTAIN_LIST
      entry = table->first;
      while (entry) {
         int r = _htable_bucket_insert(new_buckets, entry, new_hash_mask);
         #ifdef HTABLE_KEEP_STATS
         table->used_buckets += r;
         #endif
         entry = entry->next;
      }
   #else
      hash_entry_t* next;
      for (uint32_t i=0; i < table->bucket_count; i++) {
         entry = table->buckets[i];
         while (entry) {
            next = entry->bucket_next;
            int r = _htable_bucket_insert(new_buckets, entry, new_hash_mask);
            #ifdef HTABLE_KEEP_STATS
            table->used_buckets += r;
            #endif
            entry = next;
         }
      }
   #endif
   free(table->buckets);
   table->buckets = new_buckets;
   table->bucket_count = new_bucket_count;
   table->hash_mask = new_hash_mask;
}


// Get the pointer to the value of the entry or NULL if not in table
static YOUR_VALUE_T* htable_value(const hash_table_t* table, const char* key, size_t key_len) {
   // un-const table so that find_entry can keep statistics
   hash_entry_t* entry = htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL);
   return (entry != NULL) ? &entry->value : NULL;
}

static YOUR_VALUE_T htable_get(const hash_table_t* table, const char* key, size_t key_len, const YOUR_VALUE_T default_value) {
   // un-const table so that find_entry can keep statistics
   hash_entry_t* entry = htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL);
   return (entry != NULL) ? entry->value : default_value;
}

static int htable_exists(const hash_table_t* table, const char* key, size_t key_len) {
   // un-const table so that find_entry can keep statistics
   return (htable_find_entry((hash_table_t*)table, key, key_len, NULL, NULL) != NULL);
}

// Add a new entry (but don't update if it already exists)
// Returns NULL if the entry already exists (use set() if you want add or update logic)
static hash_entry_t* htable_add(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   uint32_t hash;
   hash_entry_t* res = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (res != NULL)
      return NULL;
   if ((res = (hash_entry_t*)malloc(sizeof(hash_entry_t))) == NULL)
      return NULL;
   if ((res->key = (char*)malloc(key_len + 1)) == NULL) {
      free(res);
      return NULL;
   }
   memcpy(res->key, key, key_len + 1);
   res->key_len = key_len;
   res->hash = hash;
   res->value = value;
   #ifdef HTABLE_MAINTAIN_LIST
   res->prev = NULL;
   res->next = NULL;
   if (table->first == NULL) {
      table->first = res;
      table->last = res;
   } else {
      res->prev = table->last;
      table->last->next = res;
      table->last = res;
   }
   #endif
   int r = _htable_bucket_insert(table->buckets, res, table->hash_mask);
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets += r;
   #endif
   table->entry_count++;
   _htable_adjust_as_needed(table);
   return res;
}


static hash_entry_t* htable_set(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   uint32_t hash;
   hash_entry_t* res = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (res != NULL) {
      res->value = value;
      return res;
   }
   if ((res = (hash_entry_t*)malloc(sizeof(hash_entry_t))) == NULL)
      return NULL;
   if ((res->key = (char*)malloc(key_len + 1)) == NULL) {
      free(res);
      return NULL;
   }
   memcpy(res->key, key, key_len + 1);
   res->key_len = key_len;
   res->hash = hash;
   res->value = value;
   #ifdef HTABLE_MAINTAIN_LIST
   res->prev = NULL;
   res->next = NULL;
   if (table->first == NULL) {
      table->first = res;
      table->last = res;
   } else {
      res->prev = table->last;
      table->last->next = res;
      table->last = res;
   }
   #endif
   int r = _htable_bucket_insert(table->buckets, res, table->hash_mask);
   #ifdef HTABLE_KEEP_STATS
      table->used_buckets += r;
   #endif
   table->entry_count++;
   _htable_adjust_as_needed(table);
   return res;
}

// Update an entry but don't add a a new entry it doesn't already exist. Returns NULL if doesn't exist
static hash_entry_t* htable_update(hash_table_t* table, const char* key, size_t key_len, YOUR_VALUE_T value) {
   hash_entry_t* res = htable_find_entry(table, key, key_len, NULL, NULL);
   if (res == NULL)
      return NULL;
   res->value = value;
   return res;
}

// Delete an entry. Returns 1 on success or 0 if the entry didn't exist
static int htable_delete(hash_table_t* table, const char* key, size_t key_len) {
   uint32_t hash;
   hash_entry_t* entry = htable_find_entry(table, key, key_len, &hash, &key_len);
   if (entry == NULL)
      return 0;

   #ifdef HTABLE_MAINTAIN_LIST
      if (entry == table->first)
         table->first = entry->next;
      if (entry == table->last) {
         table->last = entry->prev;
      }
      if (entry->prev != NULL)
         entry->prev->next = entry->next;
      if (entry->next != NULL)
         entry->next->prev = entry->prev;
   #endif

   uint32_t bucket = hash & table->hash_mask;
   hash_entry_t* bhead = table->buckets[bucket];
   hash_entry_t* bprev = NULL;
   #ifdef HTABLE_FAST_DELETES
      bprev = entry->bucket_prev;
   #else
      if (bhead != entry) {
         bprev = bhead;
         while (bprev->bucket_next != entry)
            bprev = bprev->bucket_next;
      }
   #endif
   if (bprev != NULL)
      bprev->bucket_next = entry->bucket_next;

   #ifdef HTABLE_FAST_DELETES
      if (entry->bucket_next != NULL)
         entry->bucket_next->bucket_prev = entry->bucket_next;
   #endif

   if (bhead == entry) {
      table->buckets[bucket] = entry->bucket_next;
      #ifdef HTABLE_KEEP_STATS
         if (entry->bucket_next == NULL)
            table->used_buckets--;
      #endif
   }

   free(entry->key);
   free(entry);
   table->entry_count--;

   #ifndef HTABLE_FAST_DELETES
      htable_pack(table);
   #endif
   return 1;
}

static void htable_pack(hash_table_t* table) {
   _htable_adjust_as_needed(table);
}

使用例 (アサーションとして) と効率テスト。intデータ値の型として使用...

hash_table_t* ht = htable_create(0, 0, 0);
assert(ht != NULL);  // Table was created successfully

// Testing basic adding/updating/getting...
assert(htable_add(ht, "hello-world", 0, 234) != NULL); // hello-world set to 234
assert(htable_add(ht, "goodbye-world", 0, 567) != NULL); // goobye-world set to 567
assert(ht->entry_count == 2); // Two entries exist (hello-world and goodbye-world)
assert(htable_exists(ht, "hello-world", 0) == 1); // hello-world exists
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goodbye-world exists
assert(htable_exists(ht, "unknown-world", 0) == 0); // unknown-world doesn't exist
assert(*htable_value(ht, "hello-world", 0) == 234); // hello-world has a value of 234
assert(*htable_value(ht, "goodbye-world", 0) == 567); // goodbye-world has a value of 567
assert(htable_get(ht, "hello-world", 0, -1) == 234); // hello-world exists and has a value of 234
assert(htable_get(ht, "goodbye-world", 0, -1) == 567); // goobye-world exists and has a value of 567
assert(htable_get(ht, "unknown-world", 0, -1) == -1); // unknown-world does not exist so the default value of -1 is returned
*htable_value(ht, "hello-world", 0) = -1; // hello-world's value is directly set via reference to -1
*htable_value(ht, "goodbye-world", 0) = -2; // goodbye-world's value is directly set via reference to -2
assert(*htable_value(ht, "hello-world", 0) == -1); // hello-world has a value of -1
assert(*htable_value(ht, "goodbye-world", 0) == -2); // goodbye-world has a value of -2
assert(htable_update(ht, "hello-world", 0, 1000) != NULL); // hello-world set to 1000
assert(htable_update(ht, "goodbye-world", 0, 2000) != NULL); // goodbye-world set to 2000
assert(htable_update(ht, "unknown-world", 0, 3000) == NULL); // unknown-world not set (it doesn't exist);
assert(ht->entry_count == 2); // Two entries exist (hello-world and goodbye-world)
assert(htable_set(ht, "hello-world", 0, 1111) != NULL); // hello-world set to 1111
assert(htable_set(ht, "goodbye-world", 0, 2222) != NULL); // goodbye-world set to 2222
assert(htable_set(ht, "unknown-world", 0, 3333) != NULL); // unknown-world added with a value of 3333
assert(ht->entry_count == 3); // Three entries exist (hello-world, goodbye-world, and unknown-world)
printf("%s\n", "After all additions and changes:");
#ifdef HTABLE_MAINTAIN_LIST
// A foreach iteration
hash_entry_t* entry = ht->first;
while (entry != NULL) {
   printf("\"%s\" = %i\n", entry->key, entry->value);
   entry = entry->next;
}
#endif
#ifdef HTABLE_KEEP_STATS
assert(ht->entry_count - ht->used_buckets == 0); // Means that no hash collisions occured
assert(ht->misses == 0); // Means that each lookup was in O(1) time
#endif

// Testing basic deletion...
assert(htable_delete(ht, "not-a-world", 0) == 0); // not-a-world not deleted (doesn't exist)
assert(htable_delete(ht, "hello-world", 0) == 1); // hello-world deleted
assert(htable_delete(ht, "hello-world", 0) == 0); // hello-world not deleted (doesn't exist)
assert(htable_exists(ht, "hello-world", 0) == 0); // hello-world doesn't exit
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goobye-world still exists
assert(htable_exists(ht, "unknown-world", 0) == 1); // unknown-world still exists
assert(ht->entry_count == 2); // Two entries exists (goodbye-world and unknown-world)
assert(htable_delete(ht, "unknown-world", 0) == 1); // unknown-world deleted
assert(htable_exists(ht, "unknown-world", 0) == 0); // unknown-world doesn't exit
assert(htable_exists(ht, "goodbye-world", 0) == 1); // goodbye-world still exists
assert(ht->entry_count == 1); // One entry exists (goodbye-world)
#ifdef HTABLE_MAINTAIN_LIST
// A foreach iteration
printf("%s\n", "After deletion:");
entry = ht->first;
while (entry != NULL) {
   printf("\"%s\" = %i\n", entry->key, entry->value);
   entry = entry->next;
}
#endif

#ifdef HTABLE_KEEP_STATS
assert(ht->entry_count - ht->used_buckets == 0); // Means that no hash collisions occured
assert(ht->misses == 0); // Means that each lookup was in O(1) time
#endif

htable_free(ht);

さらに、ランダムに生成された 5 ～ 1000 文字の長さの 100,000 個の ASCII キーを使用していくつかのテストを行った結果、次の結果が得られました...

デフォルトのパラメータを使用してランダムに入力した後:
- エントリー数: 100000
- バケット: 131072
- 使用済みバケット: 69790
- 衝突: 30210
- ミス: 71394
- ハッシュ/バケット効率: 69.79%
1/2 の成長率を使用してランダムに入力した後:
- エントリー数: 100000
- バケット: 262144
- 使用済みバケット: 83181
- 衝突: 16819
- ミス: 35436
- ハッシュ/バケット効率: 83.18%
2/1 の成長率を使用してランダムに入力した後:
- エントリー数: 100000
- バケット: 65536
- 使用済みバケット: 51368
- 衝突: 48632
- ミス: 141607
- ハッシュ/バケット効率: 51.37%

ご覧のとおり、非常に優れたパフォーマンスを発揮する可能性があります。80% の効率は、ルックアップの約 80% が O(1)、ルックアップの約 16% が O(2)、ルックアップの約 3.2% が O(3)、およびルックアップの約 0.8% が O(3) であることを意味します。 O(4+)。これは、平均してルックアップにO(1.248)かかることを意味します。

同様に、50% の効率は、ルックアップの 50% が O(1)、25% が O(2)、12.5% が O(3)、12.5% が O(4+) であることを意味します。

既知の要素に適したハッシュアルゴリズムを選択 (または記述) し、特定のニーズに合わせて微調整するだけです。

ノート：

これらのアサーション/テストはうまくいきましたが、バグがないことは保証されていません。それはかなり安定しているようです。そこにはおそらくバグが1つまたは2つ浮かんでいます。
リスト管理が必要な場合はmove()、swap()、sort()、insert()、などを管理entry->prevして簡単に追加できます。entry->next
回答サイズの制限に達したように見えるため、テストコードを含めることができませんでした。
時間分析には、ハッシュ関数も最終的な文字列比較も含まれません。これは、入力に関するすべての統計を知らずに分析することは不可能です。ただし、両方の関数は非常に高速である必要があり、入力データについてより多くの情報がわかっている場合、文字列の比較は完全に除外できます。

score 0 · Accepted Answer

可能なすべての変数名のセットを知っている場合、名前を数値に完全にハッシュするために使用することが可能です

ただし、各ハッシュテーブルは同じ長さになります。たとえば、Xとyが名前の場合、マップの長さは常に 2 になります。

0 と 1 に変わりますperfect(str)。関数は次のようになります'x''y'get

get(field, table) 
{
   return table[perfect(field)];
}

c - この状況で最小限の完全なハッシュ関数を作成することは可能ですか?

4 に答える 4

Related

Reference