java - Folding in (estimating topics for new documents) in LDA using Mallet in Java

Question

I'm using Mallet through Java, and I can't work out how to evaluate new documents against an existing topic model which I have trained.

My initial code to generate my model is very similar to that in the Mallett Developers Guide for Topic Modelling, after which I simply save the model as a Java object. In a later process, I reload that Java object from file, add new instances via .addInstances() and would then like to evaluate only these new instances against the topics found in the original training set.

This stats.SE thread provides some high-level suggestions, but I can't see how to work them into the Mallet framework.

Any help much appreciated.

score 5 · Accepted Answer

推論は実際には、質問で提供されているリンクの例(最後の数行) にもリストされています。

トレーニングされたモデルを保存/ロードし、それを使用して新しいドキュメントのモデル分布を推測するためのコード全体に興味がある人のために、いくつかのスニペットを次に示します。

が完了すると、実際にトレーニングされたmodel.estimate()モデルが得られるので、標準の Java を使用してシリアル化できますObjectOutputStream( ParallelTopicModelimplementsであるためSerializable)。

try {
    FileOutputStream outFile = new FileOutputStream("model.ser");
    ObjectOutputStream oos = new ObjectOutputStream(outFile);
    oos.writeObject(model);
    oos.close();
} catch (FileNotFoundException ex) {
    // handle this error
} catch (IOException ex) {
    // handle this error
}

ただし、推論するときは、新しい文Instanceを前処理するために同じパイプラインを介して (as として) 渡す必要があることに注意してください (tokenzie など)。したがって、パイプリストも保存する必要があります (使用しているため)SerialPipeインスタンスを作成してシリアル化できる場合):

// initialize the pipelist (using in model training)
SerialPipes pipes = new SerialPipes(pipeList);

try {
    FileOutputStream outFile = new FileOutputStream("pipes.ser");
    ObjectOutputStream oos = new ObjectOutputStream(outFile);
    oos.writeObject(pipes);
    oos.close();
} catch (FileNotFoundException ex) {
    // handle error
} catch (IOException ex) {
    // handle error
}

モデル/パイプラインをロードして推論に使用するには、シリアル化を解除する必要があります。

private static void InferByModel(String sentence) {
    // define model and pipeline
    ParallelTopicModel model = null;
    SerialPipes pipes = null;

    // load the model
    try {
        FileInputStream outFile = new FileInputStream("model.ser");
        ObjectInputStream oos = new ObjectInputStream(outFile);
        model = (ParallelTopicModel) oos.readObject();
    } catch (IOException ex) {
        System.out.println("Could not read model from file: " + ex);
    } catch (ClassNotFoundException ex) {
        System.out.println("Could not load the model: " + ex);
    }

    // load the pipeline
    try {
        FileInputStream outFile = new FileInputStream("pipes.ser");
        ObjectInputStream oos = new ObjectInputStream(outFile);
        pipes = (SerialPipes) oos.readObject();
    } catch (IOException ex) {
        System.out.println("Could not read pipes from file: " + ex);
    } catch (ClassNotFoundException ex) {
        System.out.println("Could not load the pipes: " + ex);
    }

    // if both are properly loaded
    if (model != null && pipes != null){

        // Create a new instance named "test instance" with empty target 
        // and source fields note we are using the pipes list here
        InstanceList testing = new InstanceList(pipes);   
        testing.addThruPipe(
            new Instance(sentence, null, "test instance", null));

        // here we get an inferencer from our loaded model and use it
        TopicInferencer inferencer = model.getInferencer();
        double[] testProbabilities = inferencer
                   .getSampledDistribution(testing.get(0), 10, 1, 5);
        System.out.println("0\t" + testProbabilities[0]);
    }
}

何らかの理由で、ロードされたモデルで元のモデルとまったく同じ推論が得られません-しかし、これは別の質問の問題です(誰かが知っていれば、喜んで聞いてください)

score 3 · Accepted Answer

そして、Mallet の主任開発者のスライドデッキに答えが隠されているのを見つけました。

TopicInferencer inferencer = model.getInferencer();
double[] topicProbs = inferencer.getSampledDistribution(newInstance, 100, 10, 10);

java - Folding in (estimating topics for new documents) in LDA using Mallet in Java

2 に答える 2

Related

Reference