machine-learning - 語彙と整数 (1 つのホット) 表現はどのように保存され、('string', int) タプルは torchtext.vocab() で何を意味しますか?

Question

RNNバイナリ分類のトレーニングを試みています。私は1000000語から作られた語彙を持っています.以下の出力を見つけてください...

text_field = torchtext.data.Field(tokenize=word_tokenize)

print(text_field.vocab.freqs.most_common(15))
>>
[('.', 516822), (',', 490533), ('the', 464796), ('to', 298670), ("''", 264416), ('of', 226307), ('I', 224927), ('and', 215722), ('a', 211773), ('is', 180965), ('you', 180359), ('``', 165889), ('that', 156425), ('in', 138038), (':', 132294)]

print(text_field.vocab.itos[:15])
>>
['<unk>', '<pad>', '.', ',', 'the', 'to', "''", 'of', 'I', 'and', 'a', 'is', 'you', '``', 'that']

text_field.vocab.stoi
>>
{'<unk>': 0,'<pad>': 1,'.': 2,',': 3,'the': 4,'to': 5,"''": 6,'of': 7,'I': 8,'and': 9,'a': 10, 'is': 11,'you': 12,'``': 13,'that': 14,'in': 15,....................

ドキュメントには次のように記載されています。

freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.
stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.
itos – A list of token strings indexed by their numerical identifiers.

これは私には理解できません。

どなたか、それぞれの直感を教えてください。

たとえば、theがで表される4場合、文にという単語が含まれている場合the、

場所 4 で 1 になりますか? また
464796 の位置で 1 になるか、または
464796 の位置で 4 になりますか??

複数あるとどうなるthe？？

machine-learning - 語彙と整数 (1 つのホット) 表現はどのように保存され、('string', int) タプルは torchtext.vocab() で何を意味しますか?

1 に答える 1

Related

Reference