java - Javaで指定された順序付けられたサブシーケンスのスーパーシーケンスを見つけます

Question

私が書いている無関係なプログラムでこの問題に遭遇し、それを解決するのにかなりの時間を費やしました。だったのですが、最後までやりきれませんでした。私のコードは、一部のサブセットのシーケンスのみを解決します。この問題は、おそらく何十年にもわたってさまざまな方法で解決されてきた一般的な数学の問題のようにも感じますが、オンラインでこの特定の問題に関する解決策や実際に何かを見つけるための数学のスキルと用語が不足しています.

より大きな未知の (スーパー?) シーケンスの一部であることがわかっている一連のサブシーケンスがあります。これらのサブシーケンスは順序付けられているため、数学的な意味でのセットではないと思いますが、重複する要素が含まれていないという点で類似しています。master/super/whateversequence についても同様です。(わかりやすくするために、これをスーパーシーケンスと呼びます。)

サブシーケンスにはすべて同じタイプのデータが含まれますが、データはアルファベット順、昇順、またはそのような順序ではありません。ある意味で、データは任意の順序、つまりスーパーシーケンスの順序になっています。それが私が興味を持っていることです。私はこれらの部分配列の未知の超配列を見つけたいと思っています。

簡単にするために、アルファベットの文字を使用してこの問題を解決しようとしましたが、後で必要に応じてコードをリファクタリングできます。明らかに、私はまだこの問題を解決しようとしているので、重複要素を含まないスーパーシーケンスに適した単語を考え出すことから始めました: FLOWCHARTS。

次に、次の 6 つのサブシーケンスを思いつきました。

F,W,C,R
L,H,A
L,O,H,A,R,S
C,S
R,T,S
F,O,W,H,A,S

これが私のシーケンス順序付け方法です：

// LinkedHashMappedKeyValueList keeps the data in the order it was inserted and allows one key to have multiple values.
private static LinkedHashSet<Character> orderSequence(final Set<Character> unorderedSequence, final LinkedHashMappedKeyValueList ruleMap)
{
    List<Character> orderedSequence = new ArrayList<Character>(unorderedSequence);

    // Order the sequence according to the rules.
    System.out.println("---- ORDERING SEQUENCE ----");

    for (Map.Entry<Character, LinkedHashSet<Character>> rule : ruleMap.entrySet())
    {
        char currentChar = rule.getKey();
        LinkedHashSet<Character> ruleChars = rule.getValue();

        System.out.println("Processing rule " + currentChar + "<" + ruleChars.toString());

        if (orderedSequence.contains(currentChar))
        {
            int ruleCharIndex = -1;
            int smallestRuleCharIndex = Integer.MAX_VALUE;
            Iterator<Character> it = ruleChars.iterator();

            // Find the rule character with the smallest index.
            while (it.hasNext())
            {
                char ruleChar = it.next();
                ruleCharIndex = orderedSequence.indexOf(ruleChar);
                System.out.println("\tChecking for rule character: " + ruleChar + " (" + ruleCharIndex + ")");

                if (ruleCharIndex > -1 && smallestRuleCharIndex > ruleCharIndex)
                    smallestRuleCharIndex = ruleCharIndex;
            }

            if (smallestRuleCharIndex != Integer.MAX_VALUE)
                System.out.println("\tMoving '" + currentChar + "' before '"
                        + orderedSequence.get(smallestRuleCharIndex) + "'.");
            else
                System.out.println("\tMoving '" + currentChar + "' to the end of the sequence.");

            if (!moveBefore(orderedSequence.indexOf(currentChar), smallestRuleCharIndex, orderedSequence))
                System.out.println("\tAlready in correct position.");
            else
                System.out.println("\tCurrent sequence: " + listToString(orderedSequence));
        }
        else
            throw new ArithmeticException("Element of a subsequence not a part of the sequence.");
    }

    return new LinkedHashSet<Character>(orderedSequence);
}

最後に、私のコードはF,L,O,W,H,C,A,R,T,Sこれらのサブシーケンスのスーパーシーケンスを見つけました。これはかなり近いですが完全ではありません。また、注文方法を複数回実行する必要があるため、思いついた「アルゴリズム」も完璧ではありません。「ルールマップ」とは、キーが、サブシーケンス (したがってスーパーシーケンス) のキー Character の後に来る Character オブジェクトの別のハッシュマップであるハッシュマップです。

この種のシーケンス検索を行うために使用できる、ある種の Java ライブラリはありますか? これが何と呼ばれているか、および/または仕事に適したアルゴリズムを見つけるのを手伝ってくれるという点で、誰かが私を正しい方向に向けることができますか?

さらに、私のプログラムの短縮されたコンソール出力:

---- BUILDING RULE MAP ----
Subsequences:   F,W,C,R
        L,H,A
        L,O,H,A,R,S
        C,S
        R,T,S
        F,O,W,H,A,S

All subsequences processed. Number of ordering rules: 10
Rule map: (F<[W, O]),(W<[C, H]),(C<[R, S]),(R<[, S, T]),(L<[H, O]),(H<[A]),(A<[, R, S]),(O<[H, W]),(S<[]),(T<[S])

---- BUILDING UNORDERED SEQUENCE ----
Sequence size is 10.
Unordered sequence: F,W,C,R,L,H,A,O,S,T

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'W'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Already in correct position.
Processing rule R<[, S, T]
    Moving 'R' before 'S'.
    Current sequence: F,W,C,L,H,A,O,R,S,T
Processing rule L<[H, O]
    Moving 'L' before 'H'.
    Already in correct position.
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,W,C,L,H,O,A,R,S,T
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Current sequence: F,O,W,C,L,H,A,R,S,T
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Current sequence: F,O,W,C,L,H,A,R,T,S
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,W,C,R,L,H,A,O,S,T
Ordered sequence:   F,O,W,C,L,H,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: F,O,W,L,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,O,W,C,L,H,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Current sequence: L,F,O,W,H,C,A,R,T,S
Processing rule W<[C, H]
    Moving 'W' before 'H'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: L,F,O,W,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,L,O,W,H,C,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    true
Sequence ordered according to the limits of the rule map.
Sequence found after 2 tries.

Expected sequence:  F,L,O,W,C,H,A,R,T,S FLOWCHARTS
Found sequence:     F,L,O,W,H,C,A,R,T,S FLOWHCARTS
Sequences match:    false

score 0 · Accepted Answer

私の質問で説明している問題は、Shortest Common Supersequence problemです。コードの作業中に数学的集合に集中しすぎたため、これをオンラインで検索しませんでした。質問を書き始めて初めて、シーケンスで作業していることに気付きました。実はその場で「スーパーシーケンス」という言葉をでっちあげただけだと思っていました。この問題に関連する数ページの資料をオンラインで検索する必要があったのは、まさにこの単語であることがわかりました。

オリバー・チャールズワースのコメントは完全に正しいです。与えられたサブシーケンスには、正しいスーパーシーケンスを作成するための情報が不足しているため、私の例では、実際のスーパーシーケンスを見つける方法はありません。バイオインフォマティクスに関するコーヒーのコメントは、DNA 配列の再構築に適用される最短共通スーパーストリング問題に言及している可能性が最も高く、この質問で議論されています。最短共通スーパーシーケンス問題とスーパーストリング問題は、一見非常によく似ていますが、スーパーストリング問題では、スーパーシーケンス問題とは異なり、サブストリングはスーパーストリングの隣接する要素で構成されている必要があります。(下図)

どちらの問題もNP 完全または「NP 困難」です。私が理解している方法は、これらの問題には最適な解決策や、コードにコピーして貼り付けるだけの魔法のアルゴリズムがないということです。近似値と「十分な」ソリューションしかありません。

驚くべきことに、この問題に近い完全な Java コードを見つけることはできませんでしたが、他の言語のコードを含むいくつかのサイトに出くわしました。この問題の調査中に見つけたいくつかの有用なリソースを以下にリストしました。また、関連する最短共通スーパーストリング問題に関連するリソースも含めました。

java - Javaで指定された順序付けられたサブシーケンスのスーパーシーケンスを見つけます

2 に答える 2

さらなる研究のための追加リソース

java - Javaで指定された順序付けられたサブシーケンスのスーパーシーケンスを見つけます

2 に答える 2

さらなる研究のための追加リソース

Related

Reference