1

I'm trying to create a model with a training dataset and want to label the records in a test data set.

All tutorials or help I find online has information on only using cross validation with one data set, i.e., training dataset. I couldn't find how to use test data. I tried to apply the result model on to the test set. But the test set seems to give different no. of attributes than training set after pre-processing. This is a text classification problem.

At the end I get some output like this

18.03.2013 01:47:00 Results of ResultWriter 'Write as Text (2)' [1]: 
18.03.2013 01:47:00 SimpleExampleSet:
5275 examples,
366 regular attributes,
special attributes = {
confidence_1 = #367: confidence(1) (real/single_value)
confidence_5 = #368: confidence(5) (real/single_value)
confidence_2 = #369: confidence(2) (real/single_value)
confidence_4 = #370: confidence(4) (real/single_value)
prediction = #366: prediction(label) (nominal/single_value)/values=[1, 5, 2, 4]
}

But what I wanted is all my examples to be labelled.

It seems that my test data and training data have different no. of attributes, I see many of following in the logs.

Mar 18, 2013 1:46:41 AM WARNING: Kernel Model: The given example set does not contain a regular attribute with name 'wireless'. This might cause problems for some models depending on this particular attribute.

But how do we solve such problem in text classification as we cannot know no. of and name of attributes before hand.

Can some one please throw some pointers.

4

1 に答える 1

0

おそらく、Process Documentsオペレーターを使用して、トレーニングセットとテストセットの両方を前処理します。ここでは、これらの演算子が両方とも同じように設定されていることが重要です。ワードリストを「同期」するには、つまり、両方で同じ単語のセットを検討するには、トレーニングに使用されるProcess Documentsオペレーターのワードリスト(wor)出力を、前処理に使用されるProcessDocumentsオペレーターの対応する入力ポートに接続する必要があります。テストセット。

于 2013-03-19T13:50:10.197 に答える