python - Hadoop と Python: 並べ替えを無効にする

Question

Hadoop を Python コードで実行すると、マッパーまたはレデューサー (どちらか不明) が出力をreducer.pyによって出力される前にソートしていることに気付きました。現在、英数字順にソートされているようです。これを完全に無効にする方法があるかどうか疑問に思っています。mapper.pyから印刷された順序に基づいてプログラムの出力を希望します。Java では答えが見つかりましたが、Python では答えが見つかりませんでした。mapper.pyまたはコマンドライン引数を変更する必要がありますか?

score 1 · Accepted Answer

You should read more on basic MapReduce concepts. Even though the sorting may be unnecessary in some cases, the shuffling part of the "Shuffle & Sort" phase is an intrinsic part of the MapReduce model. The MapReduce framework (Hadoop) needs to group the output of the mappers so that it sends all the keys together to one single reducer, so that the reducer can actually "reduce" the data. When using streaming, the key value pairs are--by default--separated by a tab value. From your sample code in other SO questions, I can see that you are not providing producing "key, value" tuples, but rather just single text lines.

EDIT: Added the following answer to the question "How to make it sort numerically (e.g., 9 before 10)?"

Alternative 1: Prepend zeroes to your keys so that they all have the same size. "09" comes before "10".

Alternative 2: Use the KeyFieldBasedComparator, as indicated in this SO question.

score 1 · Accepted Answer

いいえ、ここに記載されているとおり:

reduce タスクの数が 0 でない場合、hadoop フレームワークは結果をソートします。それを回避する方法はありません。

python - Hadoop と Python: 並べ替えを無効にする

2 に答える 2

Related

Reference