mapreduce - データソースが混在する MapReduce ジョブ: HBase テーブルと HDFS ファイル

Question

HBase テーブルと HDFS ファイルの両方からデータにアクセスする MR ジョブを実装する必要があります。たとえば、マッパーは HBase テーブルと HDFS ファイルからデータを読み取ります。これらのデータは同じ主キーを共有しますが、スキーマは異なります。次に、レデューサーが (HBase テーブルと HDFS ファイルからの) すべての列を結合します。

オンラインで調べてみましたが、そのような混合データソースで MR ジョブを実行する方法が見つかりませんでした。MultipleInputs は、複数の HDFS データソースに対してのみ機能するようです。何かアイデアがあれば教えてください。サンプルコードは素晴らしいでしょう。

score 8 · Accepted Answer

数日間の調査 (および HBase ユーザーメーリングリストからの支援を得て) の後、私は最終的にその方法を見つけました。ソースコードは次のとおりです。

public class MixMR {

public static class Map extends Mapper<Object, Text, Text, Text> {

    public void map(Object key, Text value, Context context) throws IOException,   InterruptedException {
        String s = value.toString();
        String[] sa = s.split(",");
        if (sa.length == 2) {
            context.write(new Text(sa[0]), new Text(sa[1]));
        }

    }

}

public static class TableMap extends TableMapper<Text, Text>  {
    public static final byte[] CF = "cf".getBytes();
    public static final byte[] ATTR1 = "c1".getBytes();

    public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {

        String key = Bytes.toString(row.get());
        String val = new String(value.getValue(CF, ATTR1));

        context.write(new Text(key), new Text(val));
    }
}


public static class Reduce extends Reducer  <Object, Text, Object, Text> {
    public void reduce(Object key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        String ks = key.toString();
        for (Text val : values){
            context.write(new Text(ks), val);
        }

    }
}

public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
    Path inputPath2 = new Path(args[1]);
    Path outputPath = new Path(args[2]);

    String tableName = "test";

    Configuration config = HBaseConfiguration.create();
    Job job = new Job(config, "ExampleRead");
    job.setJarByClass(MixMR.class);     // class that contains mapper

    Scan scan = new Scan();
    scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
    scan.setCacheBlocks(false);  // don't set to true for MR jobs
    scan.addFamily(Bytes.toBytes("cf"));

    TableMapReduceUtil.initTableMapperJob(
            tableName,        // input HBase table name
              scan,             // Scan instance to control CF and attribute selection
              TableMap.class,   // mapper
              Text.class,             // mapper output key
              Text.class,             // mapper output value
              job);


    job.setReducerClass(Reduce.class);    // reducer class
    job.setOutputFormatClass(TextOutputFormat.class);   


    // inputPath1 here has no effect for HBase table
    MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
    MultipleInputs.addInputPath(job, inputPath2,  TableInputFormat.class, TableMap.class);

    FileOutputFormat.setOutputPath(job, outputPath); 

    job.waitForCompletion(true);
}

}

score 0 · Accepted Answer

pig スクリプトまたはハイブクエリを使用すると、これを簡単に行うことができます。

サンプル豚スクリプト

tbl = LOAD 'hbase://SampleTable'
       USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
       'info:* ...', '-loadKey true -limit 5')
       AS (id:bytearray, info_map:map[],...);

fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);

Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...

mapreduce - データ ソースが混在する MapReduce ジョブ: HBase テーブルと HDFS ファイル

3 に答える 3

Related

Reference

mapreduce - データソースが混在する MapReduce ジョブ: HBase テーブルと HDFS ファイル