python - 最小数の Hadoop ストリーミング python を見つける

Question

Hadoop フレームワークと map reduce 抽象化は初めてです。

基本的には、巨大なテキストファイル（「,」で区切られた）の中で最小の数字を見つけることを考えました

だから、ここに私のコードmapper.pyがあります

 #!/usr/bin/env python

 import sys

 # input comes from STDIN (standard input)
 for line in sys.stdin:
 # remove leading and trailing whitespace
 line = line.strip()
 # split the line into words
numbers = line.split(",")
# increase counters
for number in numbers:
    # write the results to STDOUT (standard output);
    # what we output here will be the input for the
    # Reduce step, i.e. the input for reducer.py
    #
    # tab-delimited; the trivial word count is 1
    print '%s\t%s' % (number, 1)

減速機

  #!/usr/bin/env python

from operator import itemgetter
import sys
smallest_number = sys.float_info.max
for line in sys.stdin:
# remove leading and trailing whitespace
     line = line.strip()

# parse the input we got from mapper.py
     number, count = line.split('\t', 1)
     try:
           number = float(number)
     except ValueError:
            continue

     if number < smallest_number:
        smallest_number = number
        print smallest_number <---- i think the error is here... there is no key value thingy

     print smallest_number

私が得るエラー：

       12/10/04 12:07:22 ERROR streaming.StreamJob: Job not successful. Error: NA
      12/10/04 12:07:22 INFO streaming.StreamJob: killJob...
          Streaming Command Failed!

score 0 · Accepted Answer

まず、レデューサーを 1 つだけ使用しないと、ソリューションが機能しないことに注意してください。実際、複数のレデューサーを使用すると、各レデューサーは受け取った最小の数値を吐き出し、複数の数値になってしまいます。しかし、次の質問は、この問題に対して 1 つのレデューサーのみを使用する必要がある場合 (つまり、1 つのタスクのみ)、MapReduce を使用して何が得られるかということです。ここでの秘訣は、マッパーが並行して実行されることです。一方、マッパーが読み取ったすべての数値を出力するのは望ましくありません。そうしないと、1 つのレデューサーがデータ全体を調べる必要があり、シーケンシャルソリューションを改善することはできません。この問題を解決する方法は、各マッパーが読み取った最小の数値のみを出力するようにすることです。さらに、すべてのマッパー出力を同じレデューサーに送りたいので、

マッパーは次のようになります。

#!/usr/bin/env python                              

import sys

smallest = None
for line in sys.stdin:
  # remove leading and trailing whitespace          
  line = line.strip()
  # split the line into words                       
  numbers = line.split(",")
  s = min([float(x) for x in numbers])
  if smallest == None or s < smallest:
    smallest = s

print '%d\t%f' % (0, smallest)

レデューサー:

#!/usr/bin/env python                                           

import sys

smallest = None
for line in sys.stdin:
  # remove leading and trailing whitespace                       
  line = line.strip()
  s = float(line.split('\t')[1])
  if smallest == None or s < smallest:
    smallest = s

print smallest

この問題を解決する方法は他にもあります。たとえば、MapReduce フレームワーク自体を使用して数値を並べ替え、レデューサーが受け取る最初の数値が最小になるようにします。MapReduce プログラミングパラダイムをさらに理解したい場合は、私のブログからこのチュートリアルと例を読むことができます。

python - 最小数の Hadoop ストリーミング python を見つける

1 に答える 1

Related

Reference