hadoop - レデューサー後のさらなる処理

Question

おそらく非常に不完全な質問です。私は2つのドキュメントを持っており、マップで両方のドキュメントのオーバーラップを見つけて、オーバーラップを比較したいと思います（それを行うための何らかの手段があるとしましょう）

だからこれは私が考えていることです：

1) Run the normal wordcount job on one document (https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework)
2) But rather than saving a file, save everything in a HashMap(word,true)
3) Pass that HashMap along the second wordcount mapreduce program and then as I am processing the second document, check the words against the HashMap to find whether the word is present or not.

だから、このようなもの

 1) HashMap<String, boolean> hm = runStepOne(); <-- map reduce job
 2) runSteptwo(HashMap<String, boolean>)

これをHadoopで行うにはどうすればよいですか

score 3 · Accepted Answer

何らかの形式のDistributeCacheを使用して、最初の単語数のジョブの後に中間結果を保存し、これらの中間結果を利用して2番目のドキュメントで発生するかどうかをテストする別のジョブを実行できるようです。これらの両方のステップを単一のMRジョブにカプセル化できる場合がありますが、頭の中でどうすればよいかわかりません。

score 1 · Accepted Answer

結合の方法については、MapReduceを使用したデータ集約型テキスト処理のセクション3.5を確認してください。同じ論文に異なるMRアルゴリズムもあります。

hadoop - レデューサー後のさらなる処理

2 に答える 2

Related

Reference