3

私の Hadoop ジョブは、最後の削減タスクでスタックしました。3 つの異なるホストから 1 つのホストへの接続タイムアウトが何度も発生しました。ただし、問題のあるマシンに対して、他のどのマシンからでも ping を実行できます。

これは 5 ノードのクラスターです。ごく最近建てられたものです。Hadoop と pig の同じバイナリがあります。新しい機械が 3 台、古い機械が 2 台あります。2 台の古いマシンを削除した場合、正常に動作しています。

問題のある古いマシン バージョン:

java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.13) (6b20-1.9.13-0ubuntu1~10.10.1)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

新しい作業機のバージョン:

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.3) (6b24-1.11.3-1ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

また、ログから異常な量 (180) の複製ホストを見てきました。

2012-08-28 12:34:53,307 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:34:53,307 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:35:53,308 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:35:53,308 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:36:53,309 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:36:53,309 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:37:18,416 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 copy failed: attempt_201208231557_0021_m_000002_0 from Han
2012-08-28 12:37:18,417 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
    at java.net.Socket.connect(Socket.java:546)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
    at sun.net.www.http.HttpClient.New(HttpClient.java:321)
    at sun.net.www.http.HttpClient.New(HttpClient.java:338)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)

2012-08-28 12:37:18,417 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201208231557_0021_r_000008_0: Failed fetch #6 from attempt_201208231557_0021_m_000002_0
2012-08-28 12:37:18,417 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 adding host Han to penalty box, next contact in 48 seconds
2012-08-28 12:37:18,417 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0: Got 1 map-outputs from previous failures
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 0 is already in progress
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: Han Will be considered after: 13 seconds.
2012-08-28 12:38:08,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-28 12:38:53,419 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:38:53,419 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:39:53,420 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:39:53,420 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:40:53,421 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:40:53,421 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:41:08,537 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 copy failed: attempt_201208231557_0021_m_000002_0 from Han
2012-08-28 12:41:08,538 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
    at java.net.Socket.connect(Socket.java:546)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
    at sun.net.www.http.HttpClient.New(HttpClient.java:321)
    at sun.net.www.http.HttpClient.New(HttpClient.java:338)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)

2012-08-28 12:41:08,538 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201208231557_0021_r_000008_0: Failed fetch #7 from attempt_201208231557_0021_m_000002_0
2012-08-28 12:41:08,538 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 adding host Han to penalty box, next contact in 62 seconds
2012-08-28 12:41:08,538 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0: Got 1 map-outputs from previous failures
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 0 is already in progress
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: Han Will be considered after: 17 seconds.
2012-08-28 12:42:13,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-28 12:42:53,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:42:53,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)

その他のログ:

2012-09-06 11:01:47,719 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201209041720_0010_r_000014_0 1.0% reduce > reduce
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201209041720_0010_r_000014_0 is done.
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201209041720_0010_r_000014_0  was -1
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 6
2012-09-06 11:01:47,769 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201209041720_0010_r_1617080101 exited with exit code 0. Number of tasks it ran: 1
2012-09-06 11:01:47,913 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201209041720_0010_r_-434797671 exited with exit code 0. Number of tasks it ran: 1
2012-09-06 11:01:50,428 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201209041720_0010_r_000014_0
2012-09-06 11:01:50,428 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201209041720_0010_r_000014_0
2012-09-06 11:01:50,429 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201209041720_0010_r_000020_0
2012-09-06 11:01:50,429 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201209041720_0010_r_000020_0
4

4 に答える 4

3

デフォルトの投機的実行設定を使用していますか?(4回の試行)-これが役立つかもしれない記事です。

于 2012-09-06T19:46:17.323 に答える
0

ハードドライブが遅いことが原因である可能性があります。新しいマシン(Samsung 830 SSD)では、転送速度が3MB / sで、低速のマシン(回転するハードドライブ)では0.12MB/sです。おそらく、大きなレデューサーの仕事があり、遅すぎます。

于 2012-09-08T00:16:18.653 に答える
0

タスク トラッカーは、HADOOP ジョブが削減フェーズにある場合、データ ノード間で中間メディアをシャッフルする必要があります。スレーブ構成ファイルでデータ ノードのホスト名を設定すると、接続タイムアウト エラーが発生し、ホスト名を解決できません。ホスト名と IP アドレスを /etc/hosts に追加するのが簡単な解決策です。

于 2013-09-23T05:33:25.383 に答える
-1

なぜこれが起こっているのかわかりませんが、同じ問題がありました...私のデータノードは、実際の「マスター」ネームノードではなく、localhost のネームノードにアクセスしようとしていました...

hostsファイルに移動してこれを解決しました:

etc/hosts

そこからlocalhost行を削除します。これを修正するためのよりクリーンな方法があるはずだと確信していますが、この解決策は私にとってはうまくいきました。

データノードがローカルホストの名前ノードにアクセスしようとするのを防ぐためのよりクリーンな方法を誰かが持っている場合は、私に知らせてください!!!

于 2013-09-04T20:22:07.010 に答える