I am trying to learn PyTorch NLP basic text classification and following Lazy Programmer's Tutorial and I got a different result from the tutorial and when I tried to change the data, I encountered a strange change in the output.
import torchtext.legacy.data as ttd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
data = {
'label':[0, 1,1 ],
'data':[ 'ham and eggs or just morning',
'I like eggs and ham.',
'Eggs I like!',
]
}
df = pd.DataFrame(data)
df.to_csv('thedata.csv', index=False)
TEXT = ttd.Field(
sequential =True,
batch_first =True,
lower = True,
tokenize ='spacy',
pad_first = True
)
LABEL = ttd.Field(
sequential=False,
use_vocab=False,
is_target =True
)
dataset = ttd.TabularDataset(
path = 'thedata.csv',
format ='csv',
skip_header=True,
fields = [
('label', LABEL),
('data',TEXT)
]
)
train_dataset, test_dataset = dataset.split()
TEXT.build_vocab(train_dataset,)
vocab = TEXT.vocab
vocab.stoi
This is my first type of code and in the data, if you see i have used "'ham and eggs or just morning'," in index 1. So after running the code, at last when i run vocab.stoi, I get the following output. The output for the code.
import torchtext.legacy.data as ttd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
data = {
'label':[0, 1,1 ],
'data':[
'I like eggs and ham.',
'Eggs I like!',
'ham and eggs or just morning',
]
}
df = pd.DataFrame(data)
df.to_csv('thedata.csv', index=False)
TEXT = ttd.Field(
sequential =True,
batch_first =True,
lower = True,
tokenize ='spacy',
pad_first = True
)
LABEL = ttd.Field(
sequential=False,
use_vocab=False,
is_target =True
)
dataset = ttd.TabularDataset(
path = 'thedata.csv',
format ='csv',
skip_header=True,
fields = [
('label', LABEL),
('data',TEXT)
]
)
train_dataset, test_dataset = dataset.split()
TEXT.build_vocab(train_dataset,)
vocab = TEXT.vocab
vocab.stoi
Now In the second code, I have change the index of data "'ham and eggs or just morning'," in third index, now if I run the code then I get different output for vocab.stoi output for the second code. I want to know the reason for this and how vocab_build works in PyTorch. Plus, this is my first question, if the question is not clear please let me know.