c - 2 つの文字列が互いのアナグラムであるかどうかを判断するアルゴリズムはありますか?

Question

私は、ASCII文字から形成された2つの文字列が互いのアナグラムであるかどうかをチェックするために（C言語を使用して）この考えを持っています：

文字列の長さが同じかどうかを確認します。
すべての文字の ASCII 値の合計が両方の文字列で同じかどうかを確認します。
すべての文字の ASCII 値の積が両方の文字列で同じかどうかを確認します。

3つすべてが正しい場合、文字列は互いにアナグラムであるに違いないと私は信じています. しかし、私はそれを証明することはできません。これが機能することを証明または反証するのを手伝ってくれる人はいますか?

ありがとう！

score 11 · Accepted Answer

I wrote a quick program to brute-force search for conflicts and found that this approach does not always work. The strings ABFN and AAHM have the same ASCII sum and product, but are not anagrams of one another. Their ASCII sum is 279 and ASCII product is 23,423,400.

There are a lot more conflicts than this. My program, searching over all length-four strings, found 11,737 conflicts.

For reference, here's the C++ source code:

#include <iostream>
#include <map>
#include <string>
#include <vector>
using namespace std;

int main() {
  /* Sparse 2D table where used[sum][prod] is either nothing or is a string
   * whose characters sum to "sum" and whose product is "prod".
   */
  map<int, map<int, string> > used;

  /* List of all usable characters in the string. */
  vector<char> usable;
  for (char ch = 'A'; ch <= 'Z'; ch++) {
    usable.push_back(ch);
  }
  for (char ch = 'a'; ch <= 'z'; ch++) {
    usable.push_back(ch);
  }

  /* Brute-force search over all possible length-four strings.  To avoid
   * iterating over anagrams, the search only explores strings whose letters
   * are in increasing ASCII order.
   */
  for (int a = 0; a < usable.size(); a++) {
    for (int b = a; b < usable.size(); b++) {
      for (int c = b; c < usable.size(); c++) {
        for (int d = c; d < usable.size(); d++) {
          /* Compute the sum and product. */
          int sum  = usable[a] + usable[b] + usable[c] + usable[d];
          int prod = usable[a] * usable[b] * usable[c] * usable[d];

          /* See if we have already seen this. */
          if (used.count(sum) &&
              used[sum].count(prod)) {
            cout << "Conflict found: " << usable[a] << usable[b] << usable[c] << usable[d] << " conflicts with " << used[sum][prod] << endl;
          }

          /* Update the table. */
          used[sum][prod] = string() + usable[a] + usable[b] + usable[c] + usable[d];
        }
      }
    }
  }
}

Hope this helps!

score 5 · Accepted Answer

Your approach is false; I can't explain why because I don't understand it, but there are different sets at least for cardinality 3 that have the same sum and product: https://math.stackexchange.com/questions/38671/two-sets-of-3-positive-integers-with-equal-sum-and-product

score 5 · Accepted Answer

文字 az と AZ は、26 個の素数の配列のインデックスに使用され、これらの素数の積が単語のハッシュ値として使用されます。等しい製品 <--> 同じ文字。

(以下のフラグメントの primes26[] 配列のハッシュ値の順序は、予想される製品を模倣する試みとして、オランダ語の文字頻度に基づいています)

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define COUNTOF(a) (sizeof (a)/ sizeof (a)[0])

typedef unsigned long long HashVal;
HashVal hashmem (char *str, size_t len);

unsigned char primes26[] =
{
5,71,79,19,2,83,31,43,11,53,37,23,41,3,13,73,101,17,29,7,59,47,61,97,89,67,
};

struct anahash {
        struct anahash *next;
        unsigned freq;
        HashVal hash;
        char word[1];
        };

struct anahash *hashtab[1024*1024] = {NULL,};
struct anahash *new_word(char *str, size_t len);
struct anahash **hash_find(struct anahash *wp);

/*********************************************/

HashVal hashmem (char *str, size_t len)
{
size_t idx;
HashVal val=1;

if (!len) return 0;
for (idx = 0; idx < len; idx++) {
        char ch = str[idx];
        if (ch >= 'A' && ch <= 'Z' ) val *= primes26[ ch - 'A'];
        else if (ch >= 'a' && ch <= 'z' ) val *= primes26[ ch - 'a'];
        else continue;
        }
return val;
}

struct anahash *new_word(char *str, size_t len)
{
struct anahash *wp;
if (!len) len = strlen(str);

wp = malloc(len + sizeof *wp );
wp->hash = hashmem(str, len);
wp->next = NULL;
wp->freq = 0;
memcpy (wp->word, str, len);
wp->word[len] = 0;
return wp;
}

struct anahash **hash_find(struct anahash *wp)
{
unsigned slot;
struct anahash **pp;

slot = wp->hash % COUNTOF(hashtab);

for (pp = &hashtab[slot]; *pp; pp= &(*pp)->next) {
        if ((*pp)->hash < wp->hash) continue;
        if (strcmp( wp->word, (*pp)->word ) > 0) continue;
        break;
        }
return pp;
}

char buff [16*4096];
int main (void)
{
size_t pos,end;
struct anahash *wp, **pp;
HashVal val;

memset(hashtab, 0, sizeof hashtab);

while (fgets(buff, sizeof buff, stdin)) {
        for (pos=0; pos < sizeof buff && buff[pos]; ) {
                for(end = pos; end < sizeof buff && buff[end]; end++ ) {
                        if (buff[end] < 'A' || buff[end] > 'z') break;
                        if (buff[end] > 'Z' && buff[end] < 'a') break;
                        }
                if (end > pos) {
                        wp = new_word(buff+pos, end-pos);
                        if (!wp) {pos=end; continue; }
                        pp = hash_find(wp);
                        if (!*pp) *pp = wp;
                        else if ((*pp)->hash == wp->hash
                         && !strcmp((*pp)->word , wp->word)) free(wp);
                        else { wp->next = *pp; *pp = wp; }
                        (*pp)->freq +=1;
                        }
                pos = end;
                for(end = pos; end < sizeof buff && buff[end]; end++ ) {
                        if (buff[end] >= 'A' && buff[end] <= 'Z') break;
                        if (buff[end] >= 'z' && buff[end] <= 'a') break;
                        }
                pos = end;
                }
        }
for (pos = 0;  pos < COUNTOF(hashtab); pos++) {
        if (! &hashtab[pos] ) continue;

        for (pp = &hashtab[pos]; wp = *pp; pp = &wp->next) {
                if (val != wp->hash) {
                        fprintf (stdout, "\nSlot:%u:\n", pos );
                        val = wp->hash;
                        }
                fprintf (stdout, "\t%llx:%u:%s\n", wp->hash, wp->freq, wp->word);
                }
        }

return 0;
}

score 4 · Accepted Answer

素晴らしい質問をありがとう！あなたの命題を完全に反証しようとする代わりに、私はそれが真実になるようにそれを補強する方法を見つけることに時間を費やしました. 標準偏差が等しければ、両者は等しいという感覚があります。しかし、そこまでテストする代わりに、より単純なテストを行い、まだ反例を見つけていません。これが私がテストしたものです：

先ほどご紹介した条件に加えて、

二乗和の ASCII 平方根は等しくなければなりません。

次のpythonプログラムを使用します。完全な証拠はありませんが、私の回答が役立つかもしれません。とにかく、見てください。

from math import sqrt

class Nothing:



def equalString( self, strA, strB ):
    prodA, prodB = 1, 1
    sumA, sumB = 0, 0
    geoA, geoB = 0, 0

    for a in strA:
      i = ord( a )
      prodA *= i
      sumA += i
      geoA += ( i ** 2 )
    geoA = sqrt( geoA )

    for b in strB:
      i = ord( b )
      prodB *= i
      sumB += i
      geoB += ( i ** 2 )
    geoB = sqrt( geoB )

    if prodA == prodB and sumA == sumB and geoA == geoB:
      return True
    else:
      return False


  def compareStrings( self ):
    first, last = ord( 'A' ), ord( 'z' )
    for a in range( first, last + 1 ):
      for b in range( a, last + 1 ):
        for c in range( b, last + 1 ):
          for d in range( c, last + 1 ):
            strA = chr( a ) + chr( b ) + chr( c ) + chr( d )
            strB = chr( d ) + chr( c ) + chr( b ) + chr( a )

            if not self.equalString( strA, strB ):
              print "%s and %s should be equal.\n" % ( strA, strB )

    print "Done"

score 1 · Accepted Answer

文字列を変更しても構わない場合は、それぞれを並べ替えて、2 つの署名を比較してください。

c - 2 つの文字列が互いのアナグラムであるかどうかを判断するアルゴリズムはありますか?

5 に答える 5

Related

Reference