algorithm - substring finding from a string

Question

Input: string S = AAGATATGATAGGAT.

Output: Maximal repeats such as GATA (as in positions 3 and 8), GAT (as in position 3, 8 and 13) and so on...

A maximal repeat is a substring t occurs k>1 times in S, and if t is extended to left or right, it will occur less than k times.
An internal node’s leaf descendants are suffixes, each of which has a left character.
If the left characters of all leaf descendants are not all identical, it’s called a “left-diverse” node.
Maximal repeats is left-diverse internal nodes.

Overall idea:

Build a suffix tree and then do a DFS (depth first search) on the tree
For each leaf, label it with its left character
For each internal node:
If at least one child is labelled with *, then label it with *
Else if its children’s labels are diverse, label with *.
Else then all children have same label, copy it to current node

Is the above idea is correct? How does the pseudo-code to be? Then I can try to write programming myself.

score 3 · Accepted Answer

あなたのアイデアは良いですが、サフィックスツリーを使用すると、さらに簡単に実行できます。

T をシーケンスの接尾辞ツリーとします。

x を T のノードとすると、T_x はルート x を持つ T のサブツリーです。

N_x を T_x の葉の数とする

word(x) をルートからノード x まで T をトラバースすることによって作成された単語とします。

接尾辞ツリーの定義を使用すると、次のようになります。

word(x)の繰り返し回数=N_x、この単語の位置が各葉のラベル

このためのアルゴリズムは基本的なツリートラバーサルです。ツリー内の各ノードに対して N_x を計算し、N_x > 2 の場合はこれを結果に追加します (位置も必要な場合は、各リーフのラベルを追加できます)。

擬似コード:

入力:

マイシーケンス

出力：

結果 (回数と位置で繰り返される単語のリスト)

アルゴリズム :

T = suffixTree(mySequence)

T の各内部ノード X について:
  T_X = subTree(T)
  N_X = Number of lead (T_X)
  if N_X >=2 :
  Result .add ( [word(X), N_X , list(label of leafs)] )
結果を返す

例：

サフィックスツリーのウィキペディアの例を見てみましょう: "BANANA" :

ここに画像の説明を入力

我々が得る：

N_A = 3 so "A" repeats 3 times in position {1,3,5}
N_N=2  so "N" repeats 2 times  in position {2,4}
N_NA=2  so "NA" repeats 2 times in position {2,4}

あなたの問題をあなたと同じように扱っているように見えるこの論文を見つけたので、そうです、あなたの方法は write だと思います:

接尾辞ツリーを使用して、繰り返されるモチーフまたは一般的なモチーフのスペルを近似する

エキス

この記事では、2 つのアルゴリズムを紹介します。最初のものは、アルファベットシグマ上で定義されたシーケンスから繰り返されるモチーフを抽出します。例えば、シグマは｛Ａ、Ｃ、Ｇ、Ｔ｝に等しい場合があり、配列はＤＮＡ高分子のコード化を表す。検索されたモチーフは、同じアルファベットの単語に対応し、シーケンス内で最小数 q 回発生し、毎回最大で e 個の不一致があります (q は定足数制約と呼ばれます)。[...]

あなたはそれをダウンロードして見ることができます.著者はあなたのアルゴリズムの疑似コードを提供します.

お役に立てれば

algorithm - substring finding from a string

1 に答える 1

Related

Reference