machine-learning - AdaGram.jl のトレーニングテキストの問題

Question

私は Julia プログラミング言語の初心者です。マシンに Adaptive Skip-gram (AdaGram) モデルをインストールしようとしています。私は次の問題に直面しています。モデルをトレーニングする前に、トークン化されたファイルと辞書ファイルが必要です。今私の質問は、tokenize.sh と dictionary.sh に与えられるべき入力は何ですか。出力ファイルの生成が実際に行われる方法と、その拡張子を教えてください。

これは、私が言及しているウェブサイトのリンクです: https://github.com/sbos/AdaGram.jl。これはhttps://code.google.com/p/word2vec/とまったく同じです

score 5 · Accepted Answer

このパッケージは、データを前処理してモデルに適合させるためのいくつかのシェルスクリプトを提供します。シェルから、つまり Julia の外部からそれらを呼び出す必要があります。

# Install the package
julia -e 'Pkg.clone("https://github.com/sbos/AdaGram.jl.git")'
julia -e 'Pkg.build("AdaGram")'

# Download some text
wget http://www.gutenberg.org/ebooks/100.txt.utf-8

# Tokenize the text, and count the words
~/.julia/v0.3/AdaGram/utils/tokenize.sh 100.txt.utf-8 text.txt
~/.julia/v0.3/AdaGram/utils/dictionary.sh text.txt dictionary.txt

# Train the model
~/.julia/v0.3/AdaGram/train.sh text.txt dictionary.txt model

その後、Julia のモデルを使用できます。

using AdaGram
vm, dict = load_model("model");
expected_pi(vm, dict.word2id["hamlet"])
nearest_neighbors(vm, dict, "hamlet", 1, 10)

machine-learning - AdaGram.jl のトレーニング テキストの問題

1 に答える 1

Related

Reference

machine-learning - AdaGram.jl のトレーニングテキストの問題