hadoop - mahout clusterdumper からの出力の解釈

Question

クロールされたページ (25,000 を超えるドキュメント、個人データセット) でクラスタリングテストを実行しました。clusterdump を実行しました:

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt

クラスターダンパーを実行した後の出力には、25 個の要素 "VL-xxxxx {}" が表示されます。

VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}

この出力をどのように解釈しますか?

つまり、特定のクラスターに属するドキュメント ID を探しています。

の意味は何ですか:

VL-x?
n=yc=[z:z', ...]
r=[z'':z''', ...]

0:0.017 は、「0」がこのクラスターに属するドキュメント ID であることを意味しますか?

CL、n、c、r の意味については、mahout の wiki ページを既に読んでいます。しかし、誰かが私にそれらをよりよく説明してくれるか、もう少し詳しく説明されているリソースを指摘してもらえますか?

ばかげた質問をしている場合は申し訳ありませんが、私はApache mahoutの初心者であり、クラスタリングのコース割り当ての一部として使用しています。

score 4 · Accepted Answer

デフォルトでは、kmeans クラスタリングは、データポイント名を含まない WeightedVector を使用します。そこで、NamedVector を使用して自分でシーケンスファイルを作成したいと考えています。seq ファイルの数とマッピングタスクの間には 1 対 1 の対応があります。したがって、マッピング容量が 12 の場合、seqfiles NamedVecotr を作成するときに、データを 12 個に切り刻む必要があります。
```
vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]);
```

基本的に、HDFS システムから clusteredPoints をダウンロードし、独自のコードを記述して結果を出力する必要があります。クラスターポイントのメンバーシップを出力するために作成したコードを次に示します。

import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.clustering.WeightedVectorWritable;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathFilters;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.NamedVector;

public class ClusterOutput {

/**
 * @param args
 */
public static void main(String[] args) {
        // TODO Auto-generated method stub
        try {
                BufferedWriter bw;
                Configuration conf = new Configuration();
                FileSystem fs = FileSystem.get(conf);
                File pointsFolder = new File(args[0]);
                File files[] = pointsFolder.listFiles();
                bw = new BufferedWriter(new FileWriter(new File(args[1])));
                HashMap<String, Integer> clusterIds;
                clusterIds = new HashMap<String, Integer>(5000);
                for(File file:files){
                        if(file.getName().indexOf("part-m")<0)
                                continue;
                        SequenceFile.Reader reader = new SequenceFile.Reader(fs,  new Path(file.getAbsolutePath()), conf);
                        IntWritable key = new IntWritable();
                        WeightedVectorWritable value = new WeightedVectorWritable();
                        while (reader.next(key, value)) {
                                NamedVector vector = (NamedVector) value.getVector();
                                String vectorName = vector.getName();
                                bw.write(vectorName + "\t" + key.toString()+"\n");
                                if(clusterIds.containsKey(key.toString())){
                                        clusterIds.put(key.toString(), clusterIds.get(key.toString())+1);
                                }
                                else
                                        clusterIds.put(key.toString(), 1);
                        }
                        bw.flush();
                        reader.close(); 
                }
                bw.flush();
                bw.close();
                bw = new BufferedWriter(new FileWriter(new File(args[2])));
                Set<String> keys=clusterIds.keySet();
                for(String key:keys){
                        bw.write(key+" "+clusterIds.get(key)+"\n");
                }
                bw.flush();
                bw.close();
                } catch (IOException e) {
                        e.printStackTrace();
                }
        }
}

score 0 · Accepted Answer

ソースコードを読む必要があると思います - http://mahout.apache.orgからダウンロードしてください。VL-24130は、コンバージドクラスターの単なるクラスター識別子です。

score -1 · Accepted Answer

-1

mahout clusterdump https://cwiki.apache.org/MAHOUT/cluster-dumper.htmlを使用できます

于 2013-02-04T12:24:50.613 に答える

hadoop - mahout clusterdumper からの出力の解釈

4 に答える 4

Related

Reference