hadoop - マッパー出力で3つの引数を収集する方法.方法はありますか

Question

Reduceとhadoopの概念をマッピングするのは初めてです。助けてください

この形式のデータを含む約 100 個のファイルがあります

conf/iceis/GochenouerT01a:::John E. Gochenouer::Michael L. Tyler:::Voyeurism, Exhibitionism, and Privacy on the Internet.

これは、マップ削減アルゴリズムを介して行うことになっています。表示したい出力で

John E. Gochenoue Voyeurism .
John E. Gochenoue Exhibitionism 
John E. Gochenoue and 
John E. Gochenoue privacy
John E. Gochenoue on
John E. Gochenoue the
John E. Gochenoue internet   
Michael L. Tyler   Voyeurism .
Michael L. Tyler   Exhibitionism 
Michael L. Tyler   and 
Michael L. Tyler   privacy
Michael L. Tyler   on
Michael L. Tyler   the
Michael L. Tyler   internet

だから今は単線です。そのため、たくさんの名前とたくさんの本を含む行が 'n' 個あります。

したがって、110 行の 1 つのドキュメントを考えてみます。このようなマッパーの出力を取得できますか

John E. Gochenoue Voyeurism    1  
John E. Gochenoue Exhibitionism 3 
Michael L. Tyler   on           7

IE つまり、名前と作品、その後に文書内での単語の出現が表示され、最後に削減後に名前、その後に名前がそれに対して持っている単語、およびそれが出現した単語の組み合わせ頻度が表示されます。 n' ドキュメント。

output.collector() はよく知っていますが、2つの引数を取ります

output.collect(arg0, arg1)

名前、単語、単語の出現などの 3 つの値を収集する方法はありますか

以下は私のコードです

public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        String line = value.toString();
        /*
         * StringTokenizer tokenizer = new StringTokenizer(line); while
         * (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken());
         * output.collect(word, one);
         */

        String strToSplit[] = line.split(":::");
        String end = strToSplit[strToSplit.length - 1];
        String[] names = strToSplit[1].split("::");
        for (String name : names) {
            StringTokenizer tokens = new StringTokenizer(end, " ");
            while (tokens.hasMoreElements()) {
                output.collect(arg0, arg1)
                System.out.println(tokens.nextElement());
            }
        }

    }
}

public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
}

public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(example.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, "/home/vishal/workspace/hw3data");
    FileOutputFormat.setOutputPath(conf,
            new Path("/home/vishal/nmnmnmnmnm"));

    JobClient.runJob(conf);
}

score 2 · Accepted Answer

秘訣は、文字列の内容がタブで区切られた値の数であるテキスト（hadoop Writable実装の1つ）を作成することです。これにより、マッパーとレデューサーの間で複雑な値を簡単に渡すことができます。

もちろん、より産業的な強みのアプローチは、あなた自身の書き込み可能物を書くことです。書き込み可能オブジェクトは基本的に、特別なシリアル化/逆シリアル化動作を備えたpojoです。この場合、書き込み可能には3つのプロパティがあります。

score 0 · Accepted Answer

トークン化された文字列を渡すときのマッパークラスの場合、カウントするときは基本的に同じキーをグループ化する必要があります。

つまり、人が単語を使用した回数を数えるには、次のようなキーを生成する必要がありますJohn Smith<delimiter>Word。区切り文字は任意のものにすることができます。ほとんどの人は、最終的なレデューサー出力でTSVを維持するためにタブを使用します。

したがって、output.collectステートメントを修正するには、次のように変更します。

output.collect(new Text(name + "\t" + tokens.nextElement()), new IntWritable(1));

hadoop - マッパー出力で3つの引数を収集する方法.方法はありますか

2 に答える 2

Related

Reference