c# - 一致する文字列アルゴリズムの検索

Question

非常に長い 5 つの文字列があります (文字列の数は変わる可能性があります)。これらの文字列には固定の形式はありません。部分文字列の長さを示す数値を提供します。指定された長さの一致する部分文字列を見つけたい。たとえば、文字列は次のとおりです。

      1.     abcabcabc
      2.     abcasdfklop

文字列の長さ: 3

これらの値を指定すると、出力は次のようになります。

一致 #1:

 Matched string :               "abc"

 Matches in first string:        3

 Matching positions:             0,3,6

 Matches in second string:       1

 Match positions:                0

マッチ #2:

 Matched string :               "bca"

 Matches in first string:        2

 Matching positions:             1,4

 Matches in second string:       1

 Match    positions:             1

私は4つのforeachステートメントでそれを行うことができました。しかし、それはあまりにも効率が悪いように思えました.特に入力サイズが非常に大きい場合.c#でこれをより効率的に管理するための提案や短い方法はありますか? 実際のコードである必要はありません。疑似コードのみも役立ちます。

score 3 · Accepted Answer

これは接尾辞配列で行うことができます。(サフィックスツリーも問題なく動作しますが、実装にはもう少し多くのスペース、時間、注意が必要です。)

2 つの文字列を連結し、どちらにも含まれない文字で区切ります。次に、接尾辞配列を作成します。次に、答えを読み上げることができます。

標準サフィックス配列は、文字列のサフィックスへのポインターの辞書式にソートされた配列と、辞書式に連続した 2 つのサフィックスの最長共通プレフィックスの長さを示す「最長共通プレフィックス長」配列を提供します。

必要な情報を取得するために、最も長い一般的なプレフィックス長の配列を使用するのはかなり簡単です。最長共通プレフィックス長が少なくともクエリ長である、最長共通プレフィックス長配列のすべての最大部分配列を見つけ、次に、最初の文字列と 2 番目の文字列の両方で一致する各サブ配列について、適切なプレフィックスを報告し、 K+1 回発生することを報告します。ここで、K は最大部分配列の長さです。

コーディングが簡単なもう 1 つの方法は、適切な長さのすべての部分文字列をハッシュすることです。これは、任意のローリングハッシュ関数で簡単に実行できます。各ハッシュの文字列にポインターの動的配列を格納します。すべての文字列をハッシュしたら、出てきたすべてのハッシュを繰り返し、一致するものを探します。何らかの方法で誤検知に対処する必要があります。1 つの (確率的) アプローチは、偽陽性の確率が許容できるほど小さくなるまで、いくつかのハッシュ関数を使用することです。一致するものがほとんどない場合にのみ受け入れられる可能性が高い別のアプローチは、文字列を直接比較することです。

score 1 · Accepted Answer

この問題を解決するには、サフィックスツリーのバリアントを使用できます。http://en.wikipedia.org/wiki/Longest_common_substring_problem これもチェックしてください:アルゴリズム: 順序が保持されている 2 つの文字列の間のすべての共通部分文字列を検索します

score 0 · Accepted Answer

非常に大きな文字列を使用すると、メモリが問題になることがあります。以下のコードは、最も長い共通部分文字列を見つけて、より小さな共通部分文字列を含む変数を上書きしますが、インデックスと長さをリストにプッシュして文字列の配列として返すように簡単に変更できます。

これは、 https: //iq.opengenus.org/longest-common-substring-using-rolling-hash/ にある Ashutosh Singh のリファクタリングされた C++ コードです。これにより、O(N * log(N)^2) 時間で部分文字列が検出されます。および O(N) スペース

using System;
using System.Collections.Generic;
public class RollingHash
{
    private class RollingHashPowers
    {
        // _mod = prime modulus of polynomial hashing
        // any prime number over a billion should suffice
        internal const int _mod = (int)1e9 + 123;
        // _hashBase = base (point of hashing)
        // this should be a prime number larger than the number of characters used
        // in my use case I am only interested in ASCII (256) characters
        // for strings in languages using non-latin characters, this should be much larger
        internal const long _hashBase = 257;
        // _pow1 = powers of base modulo mod
        internal readonly List<int> _pow1 = new List<int> { 1 };
        // _pow2 = powers of base modulo 2^64
        internal readonly List<long> _pow2 = new List<long> { 1L };

        internal void EnsureLength(int length)
        {
            if (_pow1.Capacity < length)
            {
                _pow1.Capacity = _pow2.Capacity = length;
            }
            for (int currentIndx = _pow1.Count - 1; currentIndx < length; ++currentIndx)
            {
                _pow1.Add((int)(_pow1[currentIndx] * _hashBase % _mod));
                _pow2.Add(_pow2[currentIndx] * _hashBase);
            }
        }
    }

    private class RollingHashedString
    {
        readonly RollingHashPowers _pows;
        readonly int[] _pref1; // Hash on prefix modulo mod
        readonly long[] _pref2; // Hash on prefix modulo 2^64

        // Constructor from string:
        internal RollingHashedString(RollingHashPowers pows, string s, bool caseInsensitive = false)
        {
            _pows = pows;
            _pref1 = new int[s.Length + 1];
            _pref2 = new long[s.Length + 1];

            const long capAVal = 'A';
            const long capZVal = 'Z';
            const long aADif = 'a' - 'A';

            unsafe
            {
                fixed (char* c = s)
                {
                    // Fill arrays with polynomial hashes on prefix
                    for (int i = 0; i < s.Length; ++i)
                    {
                        long v = c[i];
                        if (caseInsensitive && capAVal <= v && v <= capZVal)
                        {
                            v += aADif;
                        }
                        _pref1[i + 1] = (int)((_pref1[i] + v * _pows._pow1[i]) % RollingHashPowers._mod);
                        _pref2[i + 1] = _pref2[i] + v * _pows._pow2[i];
                    }
                }
            }
        }

        // Rollingnomial hash of subsequence [pos, pos+len)
        // If mxPow != 0, value automatically multiply on base in needed power.
        // Finally base ^ mxPow
        internal Tuple<int, long> Apply(int pos, int len, int mxPow = 0)
        {
            int hash1 = _pref1[pos + len] - _pref1[pos];
            long hash2 = _pref2[pos + len] - _pref2[pos];
            if (hash1 < 0)
            {
                hash1 += RollingHashPowers._mod;
            }
            if (mxPow != 0)
            {
                hash1 = (int)((long)hash1 * _pows._pow1[mxPow - (pos + len - 1)] % RollingHashPowers._mod);
                hash2 *= _pows._pow2[mxPow - (pos + len - 1)];
            }
            return Tuple.Create(hash1, hash2);
        }
    }

    private readonly RollingHashPowers _rhp;
    public RollingHash(int longestLength = 0)
    {
        _rhp = new RollingHashPowers();
        if (longestLength > 0)
        {
            _rhp.EnsureLength(longestLength);
        }
    }

    public string FindCommonSubstring(string a, string b, bool caseInsensitive = false)
    {
        // Calculate max neede power of base:
        int mxPow = Math.Max(a.Length, b.Length);
        _rhp.EnsureLength(mxPow);
        // Create hashing objects from strings:
        RollingHashedString hash_a = new RollingHashedString(_rhp, a, caseInsensitive);
        RollingHashedString hash_b = new RollingHashedString(_rhp, b, caseInsensitive);

        // Binary search by length of same subsequence:
        int pos = -1;
        int low = 0;
        int minLen = Math.Min(a.Length, b.Length);
        int high = minLen + 1;
        var tupleCompare = Comparer<Tuple<int, long>>.Default;
        while (high - low > 1)
        {
            int mid = (low + high) / 2;
            List<Tuple<int, long>> hashes = new List<Tuple<int, long>>(a.Length - mid + 1);
            for (int i = 0; i + mid <= a.Length; ++i)
            {
                hashes.Add(hash_a.Apply(i, mid, mxPow));
            }
            hashes.Sort(tupleCompare);
            int p = -1;
            for (int i = 0; i + mid <= b.Length; ++i)
            {
                if (hashes.BinarySearch(hash_b.Apply(i, mid, mxPow), tupleCompare) >= 0)
                {
                    p = i;
                    break;
                }
            }
            if (p >= 0)
            {
                low = mid;
                pos = p;
            }
            else
            {
                high = mid;
            }
        }
        // Output answer:
        return pos >= 0
            ? b.Substring(pos, low)
            : string.Empty;
    }
}

c# - 一致する文字列アルゴリズムの検索

4 に答える 4

Related

Reference