1

私は MRJob と MR が初めてで、MRJob MR の従来の単語数の python の例で疑問に思っていました。

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        for word in line.split():
            yield word, 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()

タプルを生成する代わりに辞書に格納しword, sum(occurrences)て、後でアクセスできるようにすることはできますか? これを行う構文は何ですか?ありがとう!

4

2 に答える 2

2

yield の代わりに単純に list を使用できます。

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        results = []
        for word in line.split():
            results.append((word, 1)) <-- Note that the list should append a tuple here.
        return results

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()
于 2012-12-13T07:38:08.777 に答える
0

Keep in mind that the job you've got will be run on another server. Inputs and outputs are treated as problems to be managed by the script that runs your module.

If you want to use the output of your job, you'll need to either read it from wherever you've written out to (it defaults to standard out) or run the job programmatically.

It sounds like you want the latter. In a separate module, you'll want to do something like:

mr_job = MRWordCounter(args=['-r', 'emr'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        ... # do something with the parsed output

check out the docs for more details. The code sample above was taken from: http://pythonhosted.org/mrjob/guides/runners.html#runners-programmatically

于 2013-07-09T21:16:56.040 に答える