java - Hadoop Reduce 出力ファイルが大規模データ用に作成されない

Question

私は Hadoop 1.1.1 (Ubuntu) で Java でアプリケーションを作成しています。このアプリケーションは、文字列を比較して最長の共通部分文字列を見つけます。小さなデータセットに対して、map フェーズと reduce フェーズの両方が正常に実行されています。入力のサイズを増やすたびに、縮小出力がターゲット出力ディレクトリに表示されません。それはまったく文句を言わないので、これはすべて奇妙になります。私はEclipseですべてを実行しており、1つのマッパーと1つのリデューサーがあります。

私のレデューサーは、文字列のコレクションで最も長い共通部分文字列を見つけ、その部分文字列をキーとして出力し、それを含む文字列のインデックスを値として出力します。短い例があります。

入力データ

0: ALPHAA

1: ALPHAB

2: ALZHA

出される出力

Key: ALPHA  Value: 0

Key: ALPHA  Value: 1

Key: AL  Value: 0

Key: AL  Value: 1

Key: AL  Value: 2

最初の 2 つの入力文字列は両方とも "ALPHA" を共通部分文字列として共有し、3 つすべてが "AL" を共有します。部分文字列のインデックスを作成し、プロセスが完了したらデータベースに書き込みます。

追加の観察として、出力ディレクトリに中間ファイルが作成されていることがわかります。削減されたデータが出力ファイルに入れられないだけです。

以下に Hadoop の出力ログを貼り付けましたが、レデューサーからの出力レコードが多数あると主張していますが、それらが消えているように見えるだけです。任意の提案をいただければ幸いです。

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool     for the same.
No job jar file set.  User classes may not be found. See JobConf(Class) or     JobConf#setJar(String).
Total input paths to process : 1
Running job: job_local_0001
setsid exited with exit code 0
 Using ResourceCalculatorPlugin :     org.apache.hadoop.util.LinuxResourceCalculatorPlugin@411fd5a3
Snappy native library not loaded
io.sort.mb = 100
data buffer = 79691776/99614720
record buffer = 262144/327680
 map 0% reduce 0%
Spilling map output: record full = true
bufstart = 0; bufend = 22852573; bufvoid = 99614720
kvstart = 0; kvend = 262144; length = 327680
Finished spill 0
Starting flush of map output
Finished spill 1
Merging 2 sorted segments
Down to the last merge-pass, with 2 segments left of total size: 28981648 bytes

Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

Task attempt_local_0001_m_000000_0 done.
 Using ResourceCalculatorPlugin :     org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3aff2f16

Merging 1 sorted segments
Down to the last merge-pass, with 1 segments left of total size: 28981646 bytes

 map 100% reduce 0%
reduce > reduce
 map 100% reduce 66%
reduce > reduce
 map 100% reduce 67%
reduce > reduce
reduce > reduce
 map 100% reduce 68%
reduce > reduce
reduce > reduce
reduce > reduce
 map 100% reduce 69%
reduce > reduce
reduce > reduce
 map 100% reduce 70%
reduce > reduce
job_local_0001
Job complete: job_local_0001
Counters: 22
  File Output Format Counters 
    Bytes Written=14754916
  FileSystemCounters
    FILE_BYTES_READ=61475617
    HDFS_BYTES_READ=97361881
    FILE_BYTES_WRITTEN=116018418
    HDFS_BYTES_WRITTEN=116746326
  File Input Format Counters 
    Bytes Read=46366176
  Map-Reduce Framework
    Reduce input groups=27774
    Map output materialized bytes=28981650
    Combine output records=0
    Map input records=4629524
    Reduce shuffle bytes=0
    Physical memory (bytes) snapshot=0
    Reduce output records=832559
    Spilled Records=651304
    Map output bytes=28289481
    CPU time spent (ms)=0
    Total committed heap usage (bytes)=2578972672
    Virtual memory (bytes) snapshot=0
    Combine input records=0
    Map output records=325652
    SPLIT_RAW_BYTES=136
    Reduce input records=27774
reduce > reduce
reduce > reduce

score 0 · Accepted Answer

reduce() および map() ロジックを try-catch ブロック内に配置し、catch ブロックは、グループが "Exception" で名前が例外メッセージであるカウンターをインクリメントします。これにより、(カウンターリストを確認することで) どの例外がスローされたかをすばやく確認できます。

java - Hadoop Reduce 出力ファイルが大規模データ用に作成されない

1 に答える 1

Related

Reference