python - FastAI NLP 転移学習、転移学習モデルには事前トレーニング済みモデルの語彙がありますか?

Question

レッスン 8のビデオチュートリアルで、Jeremy は事前トレーニング済みの Wiki モデルを使用して独自のモデルをトレーニングできると述べました。そして、彼が、転移学習の後に、独自の言語モデルが事前トレーニング済みの Wiki モデルからのコーパスだけでなく、言語モデルからの語彙も持つことについて何か言ったことを覚えています。

しかし、自分で試してみたところ、言語モデルをトレーニングした後、言語モデルは単語と単語を認識せず、語彙に 4400 単語しかないことがわかりましlikedたmovie。

ここにコードがあります、

def get_questions(path):
    return words_df['text'].tolist()

word_path = 'words_oversampled.csv'
words_df = pd.read_csv(word_path)

dls_lm = DataBlock(
    blocks = TextBlock.from_df(words_df, is_lm=True),
    get_items=get_questions,
    splitter=RandomSplitter(0.2)
).dataloaders(word_path, bs=80)

# We get 4400 vocabulary
lm_vocab = dls_lm.vocab
len(lm_vocab), lm_vocab[-20:]

の出力len(lm_vocab)は 4400 です。

言語モデルをトレーニングした後、次の単語の予測を試みました。

TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [lm_learner.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]
print("\n".join(preds))

出力は次のとおりです。

i xxunk this xxunk because of covid wil covid man made how should medical centers respond to a covid patient what is the realistim wellness impact fi covid what happens when works Do get pay ca nt pay the covid
i xxunk this xxunk because of covid what is the best way to deal with stress during lockdown can antibiotics kill covid é covid a bio weapon why covid is worse than flu want are the descriptive statitics for the

出力からわかるように、私の言語モードは次の単語を認識していません:likedとmovie. Wiki でトレーニングされた言語モデルは、間違いなく 4400 よりも多くの単語を持ち、トレーニングされたモデルの語彙に含まれるべきであるlikedと確信しています。movie

それで、私は何を逃したのですか？

私のcsvファイルをほとんどすべてのデータセットに置き換えて、試してみることができます。

python - FastAI NLP 転移学習、転移学習モデルには事前トレーニング済みモデルの語彙がありますか?

0 に答える 0

Related

Reference