string - パターンのリストに基づいてストライプ化された文字列をトークン化する方法

Question

文字列 S とパターン [L ₁ , ..., L _n ] のリスト L が与えられた場合、L 内のパターンに一致する S 内のすべてのトークンのリストをどのように見つけ、S 内の一致する文字の総数が最大化？

ダミーの例は S = "thenuke", L = {"the", "then", "nuke"} で、"then" のマッチングから開始するかのように ["the", "nuke"] を取得したいと考えています。、一致する S の文字の総数を最大化する解は得られません。

私は他のSOの質問、文字列マッチングアルゴリズムを見てきましたが、問題の最大化部分を効率的に解決するものは何も見つかりませんでした. これはバイオインフォマティクスなどで研究されているに違いありませんが、私はその分野に携わっていないので、どんな助けでも (学術論文へのリンクを含む) 深く感謝します!

score 2 · Accepted Answer

これは、O(|S| + |L| + k) 時間で解決できます。ここで、k は、S 内の L からのすべての文字列の一致の総数です。2 つの主要なステップがあります。

Aho-Corasick を実行します。これにより、S 内の L からの任意の文字列のすべての一致が得られます。これは、上記と同じ時間で実行されます。
長さ |S| の整数の配列 A を初期化します。+ 1 をすべてゼロにします。配列を行進し、位置 i で A[i] を A[i-1] に設定します (大きい場合)。一致ごとに、M を位置 i の S の L から A[i+|M|] に設定します。 A[i+|M|] と A[i] + |M| の最大値。

まさにこれを行う Go のコードを次に示します。Aho-Corasick を呼び出すための便利なラッパーを含む、私が作成したパッケージを使用します。

package main

import (
  "fmt"
  "github.com/runningwild/stringz"
)

func main() {
  input := []byte("thenuke")
  patterns := [][]byte{[]byte("hen"), []byte("thenu"), []byte("uke")}

  // This runs Aho-Corasick on the input string and patterns, it returns a
  // map, matches, such that matches[i] is a list of indices into input where
  // patterns[i] matches the input string.  The running time of this is
  // O(|input| + |patterns| + k) and uses O(|patterns| + k) auxillary storage,
  // where k is the total number of matches found.
  find := stringz.FindSet(patterns)
  matches := find.In(input)

  // We want to invert the map so that it maps from index to all strings that
  // match at that index.
  at_pos := make([][]int, len(input)+1)
  for index, hits := range matches {
    for _, hit := range hits {
      at_pos[hit] = append(at_pos[hit], index)
    }
  }

  // Now we do a single pass through the string, at every position we will
  // keep track of how many characters in the input string we can match up to
  // that point.
  max := make([]int, len(input)+1)
  for i := range max {
    // If this position isn't as good as the previous position, then we'll use
    // the previous position.  It just means that there is a character that we
    // were unable to match anything to.
    if i > 0 && max[i-1] > max[i] {
      max[i] = max[i-1]
    }

    // Look through any string that matches at this position, if its length is
    // L, then L positions later in the string we can have matched L more
    // character if we use this string now.  This might mean that we don't use
    // another string that earlier we thought we'd be matching right now, we'll
    // find out later which one was better.
    for _, hit := range at_pos[i] {
      L := len(patterns[hit])
      if i+L < len(max) && max[i+L] < max[i]+L {
        max[i+L] = max[i] + L
      }
    }
  }

  fmt.Printf("%v\n", max)
}

score 1 · Accepted Answer

動的計画法によって時間 O(|L||S|) でこれを解くことができます: S = s ₁ s ₂ ...s _nの各初期部分文字列に最適な一致を与えるテーブルを繰り返し作成します。

B(0) は、S のゼロ長の初期部分文字列に対する最良の一致であり、空の一致です。
i < kごとに最適な一致 B( i ) が既に計算されているとします。ここで B( k )を計算します。pを長さ | の L のパターンとする。p | とし、j = k − |とする。p | + 1. p = s _j ...s _kの場合、B( j ) の後にpが続く s ₁ s ₂ ...s _kに一致します。B( k ) を、L のすべてのパターンを考慮した後に見つかった最適な一致とします。
B( n ) は、S 全体に最適です。

string - パターンのリストに基づいてストライプ化された文字列をトークン化する方法

2 に答える 2

Related

Reference