python - Why is the result different for same dataset in torchtext.legecy.text when i change the position of data in the csv file?

Question

I am trying to learn PyTorch NLP basic text classification and following Lazy Programmer's Tutorial and I got a different result from the tutorial and when I tried to change the data, I encountered a strange change in the output.


import torchtext.legacy.data as ttd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime


data = {
    'label':[0, 1,1 ],
    'data':[   'ham and eggs or just  morning',
            'I like eggs and ham.',
            'Eggs I like!',
          
           
    ]
}

df = pd.DataFrame(data)
df.to_csv('thedata.csv', index=False)
TEXT = ttd.Field(
    sequential =True,
    batch_first =True,
    lower = True,
    tokenize ='spacy',
    pad_first = True
)
LABEL = ttd.Field(
    sequential=False,
    use_vocab=False,
    is_target  =True
)

dataset = ttd.TabularDataset(
    path = 'thedata.csv',
    format ='csv',
    skip_header=True,
    fields = [
              ('label', LABEL),
              ('data',TEXT)
    ]
)
train_dataset, test_dataset = dataset.split()
TEXT.build_vocab(train_dataset,)
vocab = TEXT.vocab
vocab.stoi

This is my first type of code and in the data, if you see i have used "'ham and eggs or just morning'," in index 1. So after running the code, at last when i run vocab.stoi, I get the following output. The output for the code.


import torchtext.legacy.data as ttd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime


data = {
    'label':[0, 1,1 ],
    'data':[   
            'I like eggs and ham.',
            'Eggs I like!',
'ham and eggs or just  morning',
          
           
    ]
}

df = pd.DataFrame(data)
df.to_csv('thedata.csv', index=False)
TEXT = ttd.Field(
    sequential =True,
    batch_first =True,
    lower = True,
    tokenize ='spacy',
    pad_first = True
)
LABEL = ttd.Field(
    sequential=False,
    use_vocab=False,
    is_target  =True
)

dataset = ttd.TabularDataset(
    path = 'thedata.csv',
    format ='csv',
    skip_header=True,
    fields = [
              ('label', LABEL),
              ('data',TEXT)
    ]
)
train_dataset, test_dataset = dataset.split()
TEXT.build_vocab(train_dataset,)
vocab = TEXT.vocab
vocab.stoi

Now In the second code, I have change the index of data "'ham and eggs or just morning'," in third index, now if I run the code then I get different output for vocab.stoi output for the second code. I want to know the reason for this and how vocab_build works in PyTorch. Plus, this is my first question, if the question is not clear please let me know.

python - Why is the result different for same dataset in torchtext.legecy.text when i change the position of data in the csv file?

1 に答える 1

Related

Reference