nlp - 巨大な言語モデルで機械翻訳モデルを調整するには?

Question

Moses機械翻訳モデルを構築するソフトウェアです。AndKenLMは、モーセが使用する事実上の言語モデルソフトウェアです。

16GBのテキストを含むテキストファイルがあり、それを使用して言語モデルを構築します:

bin/lmplz -o 5 <text > text.arpa

結果のファイル ( text.arpa) は 38GB です。次に、言語モデルを次のように二値化しました。

bin/build_binary text.arpa text.binary

また、2 値化された言語モデル ( text.binary) は 71GB に増加します。

では、変換モデルをトレーニングした後、アルゴリズムmosesを使用してモデルの重みを調整する必要があります。そして、これはhttps://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/mert-moses.plMERTで簡単に実行できます。

MERT は小さな言語モデルでは問題なく動作しますが、大きな言語モデルでは完了するまでにかなりの日数がかかります。

Google 検索を行ったところ、KenLM のフィルターが見つかりました。これは、言語モデルをより小さなサイズにフィルター処理することを約束しています: https://kheafield.com/code/kenlm/filter/

しかし、私はそれを機能させる方法について無知です。コマンドのヘルプは次のように表示されます。

$ ~/moses/bin/filter
Usage: /home/alvas/moses/bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file

copy mode just copies, but makes the format nicer for e.g. irstlm's broken
    parser.
single mode treats the entire input as a single sentence.
multiple mode filters to multiple sentences in parallel.  Each sentence is on
    a separate line.  A separate file is created for each sentence by appending
    the 0-indexed line number to the output file name.
union mode produces one filtered model that is the union of models created by
    multiple mode.

context means only the context (all but last word) has to pass the filter, but
    the entire n-gram is output.

phrase means that the vocabulary is actually tab-delimited phrases and that the
    phrases can generate the n-gram when assembled in arbitrary order and
    clipped.  Currently works with multiple or union mode.

The file format is set by [raw|arpa] with default arpa:
raw means space-separated tokens, optionally followed by a tab and arbitrary
    text.  This is useful for ngram count files.
arpa means the ARPA file format for n-gram language models.

threads:m sets m threads (default: conccurrency detected by boost)
batch_size:m sets the batch size for threading.  Expect memory usage from this
    of 2*threads*batch_size n-grams.

There are two inputs: vocabulary and model.  Either may be given as a file
    while the other is on stdin.  Specify the type given as a file using
    vocab: or model: before the file name.  

For ARPA format, the output must be seekable.  For raw format, it can be a
    stream i.e. /dev/stdout

しかし、次のことを試してみると、スタックして何もしません。

$ ~/moses/bin/filter union lm.en.binary lm.filter.binary
Assuming that lm.en.binary is a model file
Reading lm.en.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

2値化後の言語モデルはどうすればよいですか? 大規模な言語モデルを操作して、チューニング時の計算負荷を軽減する他の手順はありますか?

大きな LM ファイルをチューニングする通常の方法は何ですか?

KenLM のフィルターの使い方

(詳細はhttps://www.mail-archive.com/moses-support@mit.edu/msg12089.htmlを参照)

nlp - 巨大な言語モデルで機械翻訳モデルを調整するには?

1 に答える 1

Related

Reference