java - 簡単な DAWG 作成アルゴリズム?

Question

ファイル内の単語リストを指定して、Scrabble プレーヤーの DAWG ( http://en.wikipedia.org/wiki/Directed_acyclic_word_graph ) 構造を作成する必要があります。私はJavaを使用しています。一度だけ実行してから、ファイルに保存する必要があります。これまでのところ、2 つのアプローチを見てきました。1) Trie を構築して DAWG に縮小するか、2) すぐに DAWG を構築します。一度だけ実行する必要があるので、それを実行する最も簡単なアルゴリズムを実装したいと思います。速度とメモリの要件は重要ではありません。

また、実行時に構造をメモリに保存する方法と、それをファイルに保存する方法を知りたいですか? DAWG は基本的に、私が作成したいくつかの非常に単純なクラスのいくつかのノードとエッジ/ポインターを使用することを提案するグラフですが、配列とオフセット (この配列内) を使用する実装を見ましたが、これは複雑で読みにくいようです。今回は、(実行時と保存されたファイルの) メモリサイズと、DAWG の読み込み/DAWG の使用速度の両方に注意を払いました。

score 2 · Accepted Answer

最も簡単で効率的な DAWG 構築アルゴリズムは、このペーパーで定義されており、DAWG が表す単語のセットをソートする必要があります。既存の単語リストから DAWG を構築することを計画している場合、そのリストは既に並べ替えられているか、この目的のためのものである可能性があります。

アルゴリズムの疑似コードを、この論文に記載されているものよりも「プログラマーが使いやすい」形式で大ざっぱに転写しました (免責事項: 転写エラーが発生した可能性があります。元のコードを参照してください)。存在するかどうかを判断します) :

Given:
        startState is the state from which traversal of the DAWG is to start
        register is a map of representations (hint: hashes) OF graphs which extend 
        from states in the DAWG TO said states

While there is newWord in wordList
    Get newWord from wordList
    Determine longestPrefix of newWord, starting from startState, which already exists in DAWG
    Get longestPrefixEndState, the state which the sequence of transitions defined by longestPrefix leads to
    Get suffix of newWord, the substring of newWord after longestPrefix
    if longestPrefixEndState has children 
        replace_or_register(longestPrefixEndState)
    endIf
    Create the sequence of transitions and states defined by suffix, starting from longestPrefixEndState
endWhile
replace_or_register(startState)


function replace_or_register(argState)
    Get greatestCharEndState of argState, the state which the lexicographically-greatest-char-labelled-transition in the outgoing transition set of argState leads to
    if greatestCharEndState has children
        replace_or_register(greatestCharEndState)
    endIf
    if there exists state in DAWG that is in the register and is equivalent (has an identical graph extending from it) to greatestCharEndState
        Redefine the transition that extends from argState to greatestCharEndState, as one that extends from argState to state
        Delete greatestCharEndState
    endIf
    else 
        add greatestCharEndState to the register
    endElse

Java を使用している場合、Serializableインターフェイスを利用して、シリアライゼーションとデシリアライゼーションのすべてのニーズを処理できます。

上記のアルゴリズムを実装する Java での既存の DAWG 実装に興味がある場合は、MDAGをチェックしてください。これは、他の実装にはないいくつかの気の利いた機能 (オンザフライでの文字列の追加と削除を含む) も提供し、によって維持されています。自分！

score 0 · Accepted Answer

クライアントの 1 人に対して、このような構造を C で実装する必要がありました。最終的な構造は、文字セットと dawg を記述する xml ファイルから読み込まれ、別のプロセスで単語リストから xml ファイルが作成されます。

ステップ 1 : xml ファイルにシリアル化された最初の Dawg を構築するための構造

使用したもの:

typedef struct _s_build_node build_node_t;
struct _s_build_node {
  char_t letter;
  build_node_t* first_child;
  build_node_t* right_sibling;

  hash_t hash;
  size_t depth;
  size_t ID;
};
typedef struct _s_build_dawg {
  charset_t charset;
  node_t* all_nodes; // an array of all the created nodes
  node_t* root;
} build_dawg_t;

siblibgs は昇順で並べられ、単語の終わりの特殊文字は他のどの文字よりも少なくなります。アルゴリズムは非常に単純です:

// create the build dawg
foreach word in wordlist
  insert(dawg, word)
// compact the dawg
compact(dawg)
// generate the xml file
xml_dump(dawg)

仲の良い友達をコンパクトにするために、各ノードのハッシュ値を計算しました。同じハッシュを持つ 2 つのノードは因数分解できます。この部分は難しい場合があります。深さが最も低いノードのみが保持され、他のノードは削除され、それらの親は保持されたノードを指すようになります。
圧縮したら、各ノードに一意の ID を割り当てます (bfs を介して、ID は 0 から N-1 の間で、N は圧縮された Dawg 内のノードの数です)。xml ファイルは、単純に trie を記述したものです。

<dawg>
  <charset ....>
    ...
  </charset>

  <node ID="node_id" letter="letter" fist_child="first_child_ID" next_sibling="next_sibling_id" />
  <node .... />
  <node .... />
  <node .... />
</dawg>

ステップ 2 : 最後の dagw

構造は少しシンプルに

typedef struct {
  char_t letter;
  size_t first_child;
  size_t next_sibling;
} node_t;

typedef struct {
  node_t nodes[];
  ... whatever you need ...
} dawg_t;

ここでルートはdawg.nodes[0]で、first_child/next_siblingはnodes配列のインデックスです。このような構造体は、xml ファイルから簡単に作成できます。主な欠点は、単語リストを変更すると、新しい xml ファイルが生成されることです。

java - 簡単な DAWG 作成アルゴリズム?

2 に答える 2

ステップ 1 : xml ファイルにシリアル化された最初の Dawg を構築するための構造

ステップ 2 : 最後の dagw

Related

Reference