python - 特定のデータセットの最高/最低温度のpython hadoopコード

Question

データセットから最大/最小温度を計算するマッパー/リデューサープログラムを作成しようとしています。自分で変更しようとしましたが、コードが機能しません。マッパーに変更を加えた場合、マッパーは正常に動作しますが、リデューサーは動作しません。

私のサンプルコード: mapper.py

import re
import sys

for line in sys.stdin:
  val = line.strip()
  (year, temp, q) = (val[14:18], val[25:30], val[31:32])
  if (temp != "9999" and re.match("[01459]", q)):
    print "%s\t%s" % (year, temp)

reducer.py

import sys
   (last_key, max_val) = (None, -sys.maxint)
   for line in sys.stdin:
   (key, val) = line.strip().split("\t")
   if last_key and last_key != key:
        print "%s\t%s" % (last_key, max_val)
        (last_key, max_val) = (key, int(val))
        else:
        (last_key, max_val) = (key, max(max_val, int(val)))

    if last_key:
           print "%s\t%s" % (last_key, max_val)

ファイルからのサンプル行:

690190,13910, 2012**0101, * 42.9,18 , 29.4,18, 1033.3,18, 968.7,18, 10.0,18, 8.7,18, 15.0, 999.9, 52.5 , 31.6*, 0.000I,999.90, 000

太字の値が必要です。何か案が！！

マッパーを単純なコードとして実行した場合の出力は次のとおりです。

root@ubuntu:/home/hduser/files# python maxtemp-map.py
2012    42.9
2012    50.0
2012    47.0
2012    52.0
2012    43.4
2012    52.6
2012    51.1
2012    50.9
2012    57.8
2012    50.7
2012    44.6
2012    46.7
2012    52.1
2012    48.4
2012    47.1
2012    51.8
2012    50.6
2012    53.4
2012    62.9
2012    62.6

ファイルには異なる年のデータが含まれています。年ごとに最小、最大、平均を計算する必要があります。

FIELD   POSITION  TYPE   DESCRIPTION

STN---  1-6       Int.   Station number (WMO/DATSAV3 number)
                         for the location.

WBAN    8-12      Int.   WBAN number where applicable--this is the
                         historical 
YEAR    15-18     Int.   The year.

MODA    19-22     Int.   The month and day.

TEMP    25-30     Real   Mean temperature. Missing = 9999.9


Count   32-33     Int.   Number of observations in mean temperature

score 0 · Accepted Answer

あなたの質問を解析するのに問題がありますが、これは次のようになると思います:

データセットがあり、データセットの各行は、1 つの時点に関連するさまざまな量を表します。データセット全体からこれらの数量のいずれかの最大/最小を抽出したいと考えています。

このような場合、私は次のようにします。

temps = []
with open(file_name, 'r') as infile:
    for line in infile:
        line = line.strip().split(',')
        year = int(line[2][:4])
        temp = int(line[3])
        temps.append((temp, year))

temps = sorted(temps)
min_temp, min_year = temps[0]
max_temp, max_year = temps[-1]

編集：

ファーリー、あなたがマッパー/リデューサーで行っていることは、データから必要なものに対してやり過ぎかもしれないと思います。最初のファイル構造に関する追加の質問を次に示します。

データセットの各行 (具体的に) の内容は何ですか? 例: date, time, temp, pressure, ....
各行からどのデータを抽出しますか? 温度？そのデータは行のどの位置にありますか?
各ファイルには 1 年間のデータのみが含まれていますか?

たとえば、データセットが次のように見える場合

year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...

次に、最も簡単なことは、各行をループして関連情報を抽出することです。年と気温だけが必要なようです。この例では、これらは位置0と3各行にあります。したがって、次のようなループができます。

from collections import defaultdict
data = defaultdict(list)

with open(file_name, 'r') as infile:
    for line in infile:
        line = line.strip().split(', ')
        year = line[0]
        temp = line[3]
        data[year].append(temp)

ファイルの各行からyearandを抽出し、特別なディクショナリオブジェクトに格納しました。tempこれを印刷するとどうなるか

year1: [temp1, temp2, temp3, temp4]
year2: [temp5, temp6, temp7, temp8]
year3: [temp9, temp10, temp11, temp12]
year4: [temp13, temp14, temp15, temp16]

これにより、特定の年のすべての気温に関する統計を作成することが非常に便利になります。たとえば、最高気温、最低気温、平均気温を計算するには、次のようにします。

import numpy as np
for year in data:
    temps = np.array( data[year] )
    output = (year, temps.mean(), temps.min(), temps.max())
    print 'Year: {0} Avg: {1} Min: {2} Max: {3}'.format(output)

問題を解決するお手伝いをさせていただきますが、データがどのように見えるか、何を抽出したいかについて、より具体的に説明していただく必要があります。

python - 特定のデータセットの最高/最低温度のpython hadoopコード

2 に答える 2

Related

Reference