hive - Hive を使用した Hadoop SequenceFiles の読み取り

Question

SequenceFile 形式で保存した Common Crawl からのマップ済みデータがあります。このデータを「そのまま」Hive で使用することを繰り返し試みたので、さまざまな段階でクエリとサンプリングを行うことができます。しかし、ジョブの出力には常に次のエラーが表示されます。

LazySimpleSerDe: expects either BytesWritable or Text object!

[Text, LongWritable] レコードのより単純な (そして小さい) データセットを作成しましたが、これも失敗します。データをテキスト形式に出力し、その上にテーブルを作成すると、正常に機能します。

hive> create external table page_urls_1346823845675
    >     (pageurl string, xcount bigint) 
    >     location 's3://mybucket/text-parse/1346823845675/';
OK
Time taken: 0.434 seconds
hive> select * from page_urls_1346823845675 limit 10;
OK
http://0-italy.com/tag/package-deals    643    NULL
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html    9    NULL
http://01fishing.com/fly-fishing-knots/    3437    NULL
http://01fishing.com/flyin-slab-creek/    1005    NULL
...

カスタム入力形式を使用してみました:

// My custom input class--very simple
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
public class UrlXCountDataInputFormat extends 
     SequenceFileInputFormat<Text, LongWritable> {  }

次に、次のようにテーブルを作成します。

create external table page_urls_1346823845675_seq 
  (pageurl string, xcount bigint) 
  stored as inputformat 'my.package.io.UrlXCountDataInputFormat' 
  outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'  
  location 's3://mybucket/seq-parse/1346823845675/';

しかし、それでも同じ SerDer エラーが発生します。

ここに欠けている本当に基本的なものがあると確信していますが、正しく理解できないようです。さらに、SequenceFiles をその場で解析できなければなりません (つまり、データをテキストに変換できません)。したがって、プロジェクトの将来の部分のために SequenceFile アプローチを理解する必要があります。

解決策: @mark-grover が以下で指摘したように、問題は Hive がデフォルトでキーを無視することです。列が 1 つしかない (値のみ) ため、serder は 2 番目の列をマップできませんでした。

解決策は、私が最初に使用したものよりもはるかに複雑なカスタム InputFormat を使用することでした。値の代わりにキーを使用することについて、Git へのリンクで 1 つの回答を追跡し、必要に応じてそれを変更しました。内部の SequenceFile.Reader からキーと値を取得し、それらを最終的な BytesWritable に結合します。つまり、次のようなものです (すべてのハードワークが発生するカスタムリーダーから):

// I used generics so I can use this all with 
// other output files with just a small amount
// of additional code ...
public abstract class HiveKeyValueSequenceFileReader<K,V> implements RecordReader<K, BytesWritable> {

    public synchronized boolean next(K key, BytesWritable value) throws IOException {
        if (!more) return false;

        long pos = in.getPosition();
        V trueValue = (V) ReflectionUtils.newInstance(in.getValueClass(), conf);
        boolean remaining = in.next((Writable)key, (Writable)trueValue);
        if (remaining) combineKeyValue(key, trueValue, value);
        if (pos >= end && in.syncSeen()) {
          more = false;
        } else {
          more = remaining;
        }
        return more;
    }

    protected abstract void combineKeyValue(K key, V trueValue, BytesWritable newValue);

}

// from my final implementation
public class UrlXCountDataReader extends HiveKeyValueSequenceFileReader<Text,LongWritable>
    @Override
    protected void combineKeyValue(Text key, LongWritable trueValue, BytesWritable newValue) {
        // TODO I think we need to use straight bytes--I'm not sure this works?
        StringBuilder builder = new StringBuilder();
        builder.append(key);
        builder.append('\001');
        builder.append(trueValue.get());
        newValue.set(new BytesWritable(builder.toString().getBytes()) );
    }
}

これで、すべての列を取得できました。

http://0-italy.com/tag/package-deals    643
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html    9
http://01fishing.com/fly-fishing-knots/ 3437
http://01fishing.com/flyin-slab-creek/  1005
http://01fishing.com/pflueger-1195x-automatic-fly-reels/    1999

score 0 · Accepted Answer

これが影響しているかどうかはわかりませんが、Hive は SequenceFiles を読み取るときにキーを無視します。カスタムの InputFormat を作成する必要がある場合があります (オンラインで見つからない場合:-))

参照: http://mail-archives.apache.org/mod_mbox/hive-user/200910.mbox/%3C5573211B-634D-4BB0-9123-E389D90A786C@metaweb.com%3E

hive - Hive を使用した Hadoop SequenceFiles の読み取り

1 に答える 1

Related

Reference