c++ - 3 つ以上の文字列からの最も長い共通部分文字列 - C++

Question

C++ の一連のファイル名から最長の共通部分文字列を計算する必要があります。

正確には、std::strings の std::list があります (または QT に相当するものも問題ありません)。

char const *x[] = {"FirstFileWord.xls", "SecondFileBlue.xls", "ThirdFileWhite.xls", "ForthFileGreen.xls"};
std::list<std::string> files(x, x + sizeof(x) / sizeof(*x));

すべての文字列の n 個の異なる最長共通部分文字列を計算する必要があります。この場合、たとえば n=2 の場合です。

 "File" and ".xls"

最長の共通部分列を計算できれば、それを切り取り、アルゴリズムを再度実行して 2 番目に長い部分列を取得できます。つまり、基本的には次のようになります。

std::strings の std::list の LCS を計算するための (リファレンス?) 実装はありますか?

これは良い答えではありませんが、私が持っている汚い解決策です-最後の「/」の後の部分のみが取得されるQUrlのQListに対するブルートフォース。これを「適切な」コードに置き換えたいと思います。

(私はhttp://www.icir.org/christian/libstree/を発見しました - これは非常に役立ちますが、私のマシンでコンパイルすることはできません。誰かがこれを使用したのでしょうか?)

QString SubstringMatching::getMatchPattern(QList<QUrl> urls)
    {
    QString a;

    int foundPosition = -1;
    int foundLength = -1;
    for (int i=urls.first().toString().lastIndexOf("/")+1; i<urls.first().toString().length(); i++)
    {
        bool hit=true;
        int xj;
        for (int j=0; j<urls.first().toString().length()-i+1; j++ ) // try to match from position i up to the end of the string :: test character at pos. (i+j)
        {
            if (!hit) break;

            QString firstString = urls.first().toString().right( urls.first().toString().length()-i ).left( j ); // this needs to match all k strings
            //qDebug() << "SEARCH " << firstString;

            for (int k=1; k<urls.length(); k++) // test all other strings, k = test string number
            {
                if (!hit) break;

                //qDebug() << " IN  " << urls.at(k).toString().right(urls.at(k).toString().length() - urls.at(k).toString().lastIndexOf("/")+1);
                //qDebug() << " RES " << urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1);
                if (urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1)<0) {
                    xj = j;
                    //qDebug() << "HIT LENGTH " << xj-1 << " : " << firstString;
                    hit = false;
                }
            }

        }
        if (hit) xj = urls.first().toString().length()-i+1; // hit up to the end of the string
        if ((xj-2)>foundLength) // have longer match than existing, j=1 is match length
        {
            foundPosition = i; // at the current position
            foundLength = xj-1;
            //qDebug() << "Found at " << i << " length " << foundLength;
        }
    }

    a = urls.first().toString().right( urls.first().toString().length()-foundPosition ).left( foundLength );
    //qDebug() << a;
    return a;
}

score 1 · Accepted Answer

あなたが言うように、接尾辞ツリーが重すぎるか、そうでなければ実用的でない場合は、次のかなり単純なブルートフォースアプローチがアプリケーションに適している可能性があります。

個別の部分文字列は重複せず、左から右に選択されると想定しています。

これらの仮定があっても、文字列のセットの「 N 個の異なる最長共通部分文字列」を含む一意のセットが存在する必要はありません。Nが何であれ、最大長がすべて同じNを超える異なる共通部分文字列が存在する可能性があり、それらの中からのNの選択は任意です。したがって、解決策は、同じ長さのすべてが 1 つのセットである最長の異なる共通部分文字列の最大N 個の *セット* を見つけます。

アルゴリズムは次のとおりです。

Qは長さのターゲットクォータです。
Stringsは文字列の問題セットです。
Resultsは、長さを文字列のセットにマップする最初は空のマルチマップです。Results [l]は長さlのセットです。
Nは、最初は 0 で、結果で表される異なる長さの数です
Qが 0 またはStringsが空の場合、結果を返します
Stringsの最短メンバーを見つけます。Sのコピーを保持し、Stringsから削除します。{ Strings , S }のすべての共通部分文字列はSの部分文字列でなければならないため、 Sの部分文字列をStringsの部分文字列と比較することから始めます。
オフセットと長さによって制御される明らかなネストされたループを使用して、最初に最長のSのすべての部分文字列を繰り返し生成します。Sの各部分文字列ssについて :
- ssがStringsの共通部分文字列でない場合は、次へ。
- Results[l] for l >= ssの長さをResultsの終わりまで、またはssが検査結果の部分文字列であることが判明するまで繰り返します。後者の場合、ssはすでに手元にある結果と区別できないので、次へ。
- ssは、すでに手元にあるものとは異なる一般的な部分文字列です。l < ssの長さのResults[l]を反復処理し、 ssの部分文字列である各結果を削除します。これらはすべてssよりも短く、それと区別できないためです。ssは現在、既に手元にあるものとは異なる共通の部分文字列であり、手元に残っている他のすべてのものはssとは異なります。
- l = ssの長さの場合、 Results[l]が存在するかどうか、つまりssと同じ長さの結果が手元にあるかどうかを確認します。そうでない場合は、それをNewLength条件と呼びます。
- N == Qかどうかも確認してください。つまり、異なる長さの目標クォータにすでに達しています。NewLengthが得られ、さらにN == Qである場合、それをStickOrRaise条件と呼びます。
- StickOrRaiseが得られた場合、 ssの長さとl = 手元にある最短の結果の長さを比較します。ssがlよりも短い場合は、クォータに対して短すぎるため、次へ進みます。ssがlよりも長い場合、手元にあるすべての最短の結果はssを優先して除外されるため、 Results[l]を削除し、 Nを減らします。
- その長さでキー付けされた結果にssを挿入します。
- NewLengthが得られた場合、 Nをインクリメントします。
- ssと同じオフセットを持つが短いSの部分文字列に対する内側の反復を放棄します。
- ssの長さだけ、外側の反復のSのオフセットを次の重複しない部分文字列の先頭まで進めます。
結果を返します。

ソリューションを実装し、文字列のリストでそれを示すプログラムを次に示します。

#include <list>
#include <map>
#include <string>
#include <iostream>
#include <algorithm>

using namespace std;

// Get a non-const iterator to the shortest string in a list
list<string>::iterator shortest_of(list<string> & strings)
{
    auto where = strings.end();
    size_t min_len = size_t(-1);
    for (auto i = strings.begin(); i != strings.end(); ++i) {
        if (i->size() < min_len) {
            where = i;
            min_len = i->size();
        }
    }
    return where;
}

// Say whether a string is a common substring of a list of strings
bool 
is_common_substring_of(
    string const & candidate, list<string> const & strings)
{
    for (string const & s : strings) {
        if (s.find(candidate) == string::npos) {
            return false;
        }
    }
    return true;
}


/* Get a multimap whose keys are the at-most `quota` greatest 
    lengths of common substrings of the list of strings `strings`, each key 
    multi-mapped to the set of common substrings of that length.
*/
multimap<size_t,string> 
n_longest_common_substring_sets(list<string> & strings, unsigned quota)
{
    size_t nlengths = 0;
    multimap<size_t,string> results;
    if (quota == 0) {
        return results;
    }
    auto shortest_i = shortest_of(strings);
    if (shortest_i == strings.end()) {
        return results;
    }
    string shortest = *shortest_i;
    strings.erase(shortest_i);
    for ( size_t start = 0; start < shortest.size();) {
        size_t skip = 1;
        for (size_t len = shortest.size(); len > 0; --len) {
            string subs = shortest.substr(start,len);
            if (!is_common_substring_of(subs,strings)) {
                continue;
            }
            auto i = results.lower_bound(subs.size());
            for (   ;i != results.end() && 
                    i->second.find(subs) == string::npos; ++i) {}
            if (i != results.end()) {
                continue;
            }
            for (i = results.begin(); 
                    i != results.end() && i->first < subs.size(); ) {
                if (subs.find(i->second) != string::npos) {
                    i = results.erase(i);
                } else {
                    ++i;
                }
            }
            auto hint = results.lower_bound(subs.size());
            bool new_len = hint == results.end() || hint->first != subs.size();
            if (new_len && nlengths == quota) {
                size_t min_len = results.begin()->first;
                if (min_len > subs.size()) {
                    continue;
                }
                results.erase(min_len);
                --nlengths;
            }
            nlengths += new_len;
            results.emplace_hint(hint,subs.size(),subs);
            len = 1;
            skip = subs.size();
        }
        start += skip;
    }
    return results; 
}

// Testing ...

int main()
{
    list<string> strings{
        "OfBitWordFirstFileWordZ.xls", 
        "SecondZWordBitWordOfFileBlue.xls", 
        "ThirdFileZBitWordWhiteOfWord.xls", 
        "WordFourthWordFileBitGreenZOf.xls"};

    auto results = n_longest_common_substring_sets(strings,4);
    for (auto const & val : results) {
        cout << "length: " << val.first 
        << ", substring: " << val.second << endl;
    }
    return 0;
}

出力：

length: 1, substring: Z
length: 2, substring: Of
length: 3, substring: Bit
length: 4, substring: .xls
length: 4, substring: File
length: 4, substring: Word

(gcc 4.8.1 でビルド)

c++ - 3 つ以上の文字列からの最も長い共通部分文字列 - C++

1 に答える 1

Related

Reference