string - キーワードのリストを検索する

Question

受信したすべてのテキストメッセージをキーワードのリストと比較し、テキストメッセージにリスト内のキーワードの 1 つが含まれている場合は、それを削除する必要があるメッセージングアンチスパムソリューションを構築しています。

問題は、キーワードのリストを検索するための最適なアルゴリズムは何ですか? 例は以下です

text message received is "hi how are you, visit us at www.xyz.com"

リストのサンプルは以下のとおりです

www.abc.com
www.xyz.com
...
...

score 1 · Accepted Answer

多くのキーワードがあり、特に共通のプレフィックスがある場合は、ここでトライがうまく機能する可能性があります。

単語だけでなく、部分文字列が必要だと仮定します。つまり、キーワードが与えられたbah場合bah、bahama. これを防ぐためにこれを変更することは難しくありません。

また、キーワードがなく、キーワードとして部分文字列であると仮定します (つまりbah、bahama両方をキーワードにすることはできません)。このためのケータリングもそれほど難しくありません。

文字列内の各文字について、ツリーの一番上から検索を開始し、ツリー内の既存の各ポインターを検索し続けます。ポインターの 1 つが有効な単語に到達したら、それを希望どおりに処理し、おそらくツリー内のすべてのポインターを削除します。

複雑：

O(max(n², mn))wheremはツリー内のノードの数です。最悪の場合、平均的なケースのパフォーマンスははるかに優れているはずです。

例：

では、キーワードがあるとしましょう:

ab
b
caa

次のようなツリーが得られるかもしれません:

      o
     /|\
  a / | \ c
   /  |b \
  o   o   o
  | b     | a
  o       o
          | a
          o

(oは単なるノードです)

ここで、入力文字列についてcaab、まずc: (xツリー内のポインターを示します)を調べます。

      o
     /|\
  a / | \ c
   /  |b \
  o   o   x
  | b     | a
  o       o
          | a
          o

右側の新しいポインターに注意してください。

次にa：

      o
     /|\
  a / | \ c
   /  |b \
  x   o   o
  | b     | a
  o       x
          | a
          o

左側の新しいポインタと右側の高度なポインタに注意してください。

次にa：

      o
     /|\
  a / | \ c
   /  |b \
  o   o   o
  | b     | a
  o       o
          | a
          x

左側のポインターが消え、右側のポインターが進んでいることに注意してください。

有効な単語が見つかったので、右側の単語を削除します。

次にb：

      o
     /|\
  a / | \ c
   /  |b \
  o   x   o
  | b     | a
  o       o
          | a
          o

真ん中の新しいポインターに注意してください。有効な単語が見つかったので、これも削除します。

score 0 · Accepted Answer

いくつのキーワードについて話しているのですか? Boyer-Moore String Search アルゴリズムを調べてください。目的に適している可能性があり、実装も難しくありません。ウィキペディアの記事から抜粋した Java 実装を次に示します。

 /**
   * Returns the index within this string of the first occurrence of the
   * specified substring. If it is not a substring, return -1.
   *
   * @param haystack The string to be scanned
   * @param needle The target string to search
   * @return The start index of the substring
   */
  public static int indexOf(char[] haystack, char[] needle) {
    if (needle.length == 0) {
      return 0;
    }
    int charTable[] = makeCharTable(needle);
    int offsetTable[] = makeOffsetTable(needle);
    for (int i = needle.length - 1, j; i < haystack.length;) {
      for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) {
        if (j == 0) {
          return i;
        }
      }
      // i += needle.length - j; // For naive method
      i += Math.max(offsetTable[needle.length - 1 - j], charTable[haystack[i]]);
    }
    return -1;
  }

  /**
   * Makes the jump table based on the mismatched character information.
   */
  private static int[] makeCharTable(char[] needle) {
    final int ALPHABET_SIZE = 256;
    int[] table = new int[ALPHABET_SIZE];
    for (int i = 0; i < table.length; ++i) {
      table[i] = needle.length;
    }
    for (int i = 0; i < needle.length - 1; ++i) {
      table[needle[i]] = needle.length - 1 - i;
    }
    return table;
  }

  /**
   * Makes the jump table based on the scan offset which mismatch occurs.
   */
  private static int[] makeOffsetTable(char[] needle) {
    int[] table = new int[needle.length];
    int lastPrefixPosition = needle.length;
    for (int i = needle.length - 1; i >= 0; --i) {
      if (isPrefix(needle, i + 1)) {
        lastPrefixPosition = i + 1;
      }
      table[needle.length - 1 - i] = lastPrefixPosition - i + needle.length - 1;
    }
    for (int i = 0; i < needle.length - 1; ++i) {
      int slen = suffixLength(needle, i);
      table[slen] = needle.length - 1 - i + slen;
    }
    return table;
  }

  /**
   * Is needle[p:end] a prefix of needle?
   */
  private static boolean isPrefix(char[] needle, int p) {
    for (int i = p, j = 0; i < needle.length; ++i, ++j) {
      if (needle[i] != needle[j]) {
        return false;
      }
    }
    return true;
  }

  /**
   * Returns the maximum length of the substring ends at p and is a suffix.
   */
  private static int suffixLength(char[] needle, int p) {
    int len = 0;
    for (int i = p, j = needle.length - 1;
         i >= 0 && needle[i] == needle[j]; --i, --j) {
      len += 1;
    }
    return len;
  }

string - キーワードのリストを検索する

2 に答える 2

Related

Reference