hadoop - Hadoop map-reduce : グループ化中のレコードの順序

Question

入力の各行にレコードがあり、各レコードには約 10 個のフィールドがあります。まず、レコードを 3 つのフィールドでグループ化して、(field1, field2, field3)1 つのマッパー/リデューサーが (3 つのフィールドに基づいて) 1 つの一意のグループを担当するようにします。各グループ内で、別の整数フィールドに基づいてレコードを並べ替え、別のフィールドを追加timestampしてグループ内の各レコードに同じタグaTagを付けます。

mapper#1 でソートされたグループにとしてタグを付け、aTagmapper#2 で別のグループ (最初に 3 つのフィールドに基づいてレコードをグループ化したため別のグループ) に同じタグを付けたとしaTagます。

ここで、タグフィールドに基づいてレコードをグループ化すると (つまり、異なるマッパーでグループをグループ化すると)、各グループ内の順序が保持されていないことに気付きます。各マッパーにはすべてのレコードが同じタグを持つグループがあるため、タグ名によるグループ化には、他のマッパーから関連するグループを取得し、個々のグループを並べ替えずにそれらを連結するだけでよいと予想していました。

レコードを gzip 形式で保存しようとしているために、圧縮を改善するためにレコードを並べ替えようとしているためですか? また、タグ名でグループ化した後に順序を保持する方法を知りたいです。

score 2 · Accepted Answer

It seems that you are trying to implement the sort step of MapReduce yourself in local memory, but then it completely ignores what you did and re-sorts the items in each group anyway. The proper way to fix this would be to specify a comparator on the keys, so that within each partition so that the merged input to the reducer is according to that comparison function. This means that

You don't have to do the sorting yourself
You don't run out of memory on one machine trying to sort a really large group.

It seems on your case that you'd want to add timestamp to the set of keys, tell it to partition on the first three keys, and tell it to sort on the timestamp.

For more information, see the following diagram, and Where is Sort used in MapReduce phase and why?

enter image description here

hadoop - Hadoop map-reduce : グループ化中のレコードの順序

1 に答える 1

Related

Reference