weka - すべてのデータベースクエリで同じインスタンスヘッダー ( arff )

Question

InstanceQuery、SQL クエリを使用して、 Instancesを構築しています。しかし、クエリの結果は、SQL では通常の順序で常に同じ順序になるわけではありません。このため、異なる SQL から構築されたインスタンスには異なるヘッダーがあります。簡単な例を以下に示します。この動作により、結果が変わるのではないかと思います。

ヘッダー 1

@attribute duration numeric
@attribute protocol_type {tcp,udp}
@attribute service {http,domain_u}
@attribute flag {SF}

ヘッダー 2

@attribute duration numeric
@attribute protocol_type {tcp}
@attribute service {pm_dump,pop_2,pop_3}
@attribute flag {SF,S0,SH}

私の質問は次のとおりです。インスタンス構築に正しいヘッダー情報を与えるにはどうすればよいですか。

以下のワークフローのようなものは可能ですか？

arff ファイルまたは別の場所から事前に準備されたヘッダー情報を取得します。
インスタンス構築にこのヘッダ情報を与える
SQL 関数を呼び出してインスタンス (ヘッダー + データ) を取得する

次のSQL関数を使用して、データベースからインスタンスを取得しています。

public static Instances getInstanceDataFromDatabase(String pSql
                                      ,String pInstanceRelationName){
    try {
        DatabaseUtils utils = new DatabaseUtils();

        InstanceQuery query = new InstanceQuery();

        query.setUsername(username);
        query.setPassword(password);
        query.setQuery(pSql);

        Instances data = query.retrieveInstances();
        data.setRelationName(pInstanceRelationName);

        if (data.classIndex() == -1)
        {
              data.setClassIndex(data.numAttributes() - 1);
        }
        return data;
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

score 0 · Accepted Answer

Addに属性を追加できるフィルターで同様の問題を解決しましたInstances。両方のデータセットに適切な値のリストを追加する必要がありAttibuteます（私の場合-データセットのみをテストするため）：

訓練データとテストデータをロードします。

/* "train" contains labels and data */
/* "test" contains data only */
CSVLoader csvLoader = new CSVLoader();
csvLoader.setFile(new File(trainFile));
Instances training = csvLoader.getDataSet();
csvLoader.reset();
csvLoader.setFile(new File(predictFile));
Instances test = csvLoader.getDataSet();

Addフィルターを使用して新しい属性を設定します。

Add add = new Add();
/* the name of the attribute must be the same as in "train"*/
add.setAttributeName(training.attribute(0).name());
/* getValues returns a String with comma-separated values of the attribute */
add.setNominalLabels(getValues(training.attribute(0)));
/* put the new attribute to the 1st position, the same as in "train"*/
add.setAttributeIndex("1");
add.setInputFormat(test);
/* result - a compatible with "train" dataset */
test = Filter.useFilter(test, add);

その結果、「train」と「test」の両方のヘッダーが同じになります (Weka 機械学習と互換性があります)。

score 0 · Accepted Answer

私は自分の問題に対してさまざまなアプローチを試みました。しかし、weka の内部 API では、現在この問題を解決することはできないようです。目的に合わせて weka.core.Instances append コマンドラインコードを変更しました。このコードはこの回答にも記載されています

これによると、ここに私の解決策があります。正しいヘッダー値を含む SampleWithKnownHeader.arff ファイルを作成しました。このファイルを次のコードで読み取ります。

public static Instances getSampleInstances() {
    Instances data = null;
    try {
        BufferedReader reader = new BufferedReader(new FileReader(
                "datas\\SampleWithKnownHeader.arff"));
        data = new Instances(reader);
        reader.close();
        // setting class attribute
        data.setClassIndex(data.numAttributes() - 1);
    }
    catch (Exception e) {
        throw new RuntimeException(e);
    } 
    return data;

}

その後、次のコードを使用してインスタンスを作成します。StringBuilder とインスタンスの文字列値を使用する必要があり、対応する文字列をファイルに保存しました。

public static void main(String[] args) {

    Instances SampleInstance = MyUtilsForWeka.getSampleInstances();

    DataSource source1 = new DataSource(SampleInstance);

    Instances data2 = InstancesFromDatabase
            .getInstanceDataFromDatabase(DatabaseQueries.WEKALIST_QUESTION1);

    MyUtilsForWeka.saveInstancesToFile(data2, "fromDatabase.arff");

    DataSource source2 = new DataSource(data2);

    Instances structure1;
    Instances structure2;
    StringBuilder sb = new StringBuilder();
    try {
        structure1 = source1.getStructure();
        sb.append(structure1);
        structure2 = source2.getStructure();
        while (source2.hasMoreElements(structure2)) {
            String elementAsString = source2.nextElement(structure2)
                    .toString();
            sb.append(elementAsString);
            sb.append("\n");

        }

    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }

    MyUtilsForWeka.saveInstancesToFile(sb.toString(), "combined.arff");

}

インスタンスをファイルに保存するコードは次のとおりです。

public static void saveInstancesToFile(String contents,String filename) {

     FileWriter fstream;
    try {
        fstream = new FileWriter(filename);
      BufferedWriter out = new BufferedWriter(fstream);
      out.write(contents);
      out.close();
    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }

これで問題は解決しますが、もっとエレガントな解決策があるのではないかと思います。

weka - すべてのデータベース クエリで同じインスタンス ヘッダー ( arff )

2 に答える 2

Related

Reference

weka - すべてのデータベースクエリで同じインスタンスヘッダー ( arff )