java - Hadoop クラスターが [Reduce] > [copy] > [copy] でハングアップする

Question

これまでのところ、この問題について、ここ1とここ2から解決策を試しました。ただし、これらのソリューションでは mapreduce タスクが実行されますが、ここで3のような出力が得られるため、name ノードでのみ実行されるように見えます。

基本的に、私は自分で設計した mapreduce アルゴリズムを使用して2 ノードクラスターを実行しています。mapreduce jar は、単一ノードクラスターで完全に実行されます。これにより、私の Hadoop マルチノード構成に何か問題があると思います。マルチノードを設定するには、こちらのチュートリアルに従いました。

何が問題なのかを報告するために、プログラムを実行すると (namenodes、tasktrackers、jobtrackers、および Datanodes がそれぞれのノードで実行されていることを確認した後)、私のプログラムは端末で次の行で停止します。

INFO mapred.JobClient: map 100% reduce 0%

タスクのログを見ると、copy failed: attempt... from slave-nodeその後にSocketTimeoutException.

スレーブノード(DataNode)のログを見ると、次の行で実行が停止していることがわかります。

TaskTracker: attempt... 0.0% reduce > copy >

リンク 1 と 2 の解決策が示唆するように、ファイルからさまざまな IP アドレスを削除するetc/hostsと正常に実行されますが、スレーブノード (DataNode) ログのリンク 4 などの項目が表示されます。たとえば、次のようになります。

INFO org.apache.hadoop.mapred.TaskTracker: Received 'KillJobAction' for job: job_201201301055_0381

WARN org.apache.hadoop.mapred.TaskTracker: Unknown job job_201201301055_0381 being deleted.

これは、新しい Hadoop ユーザーとして私には疑わしいように見えますが、これを見るのは完全に正常な場合があります。私には、何かがhosts ファイル内の間違った IP アドレスを指しているように見えます。この IP アドレスを削除することで、単にスレーブノードでの実行を停止し、代わりに namenodeで処理を続行します(これはあまり有利ではありません)。まったく）。

総括する：

この出力は期待されていますか?
実行後にどのノードで何が実行されたかを確認する方法はありますか?
誰かが私が間違ったことをしたかもしれないことを見つけることができますか?

各ノードに追加されたホストと構成ファイルを編集します

マスター: etc/hosts

127.0.0.1       localhost
127.0.1.1       joseph-Dell-System-XPS-L702X

#The following lines are for hadoop master/slave setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

スレーブ: etc/hosts

127.0.0.1       localhost
127.0.1.1       joseph-Home # this line was incorrect, it was set as 7.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

マスター: core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hduser/tmp</value>
    <description>A base for other temporary directories.</description>
</property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:54310</value>
        <description>The name of the default file system. A URI whose
        scheme and authority determine the FileSystem implementation. The
        uri’s scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class. The uri’s authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>
</configuration>

スレーブ: core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hduser/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:54310</value>
        <description>The name of the default file system. A URI whose
        scheme and authority determine the FileSystem implementation. The
        uri’s scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class. The uri’s authority is used to
        determine the host, port, etc. for a filesystem.</description>
    </property>

</configuration>

マスター: hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

スレーブ: hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>
</configuration>

マスター: mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>master:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at. If “local”, then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>
</configuration>

スレーブ: mapre-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>mapred.job.tracker</name>
        <value>master:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at. If “local”, then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>

</configuration>

score 2 · Accepted Answer

The error is in etc/hosts:

During the erroneous runs, the slave etc/hosts file looked like this:

127.0.0.1       localhost
7.0.1.1       joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

As you may have spotted, the ip address of this computer 'joseph-Home' was incorrectly configured. It was set to 7.0.1.1, when it should be set to 127.0.1.1. Therefore, changing the slave etc/hosts file, line 2, to 127.0.1.1 joseph-Home fixed the issue, and my logs appear normally on the slave node.

New etc/hosts file:

127.0.0.1       localhost
127.0.1.1       joseph-Home # THIS LINE IS INCORRECT, IT SHOULD BE 127.0.1.1

#the following lines are for hadoop mutli-node cluster setup
192.168.1.87    master
192.168.1.74    slave

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

score 0 · Accepted Answer

テスト済みの解決策は、以下のプロパティを hadoop-env.sh に追加し、すべての hadoop クラスターサービスを再起動することです。

hadoop-env.sh

export HADOOP_CLIENT_OPTS="-Xmx2048m $HADOOP_CLIENT_OPTS"

score 0 · Accepted Answer

今日もこの問題に遭遇しました。私の場合の問題は、クラスター内の 1 つのノードのディスクがいっぱいであるため、hadoop がログファイルをローカルディスクに書き込むことができないため、この問題の解決策として、ローカルディスク上の未使用のファイルを削除することが考えられます。それが役に立てば幸い

java - Hadoop クラスターが [Reduce] > [copy] > [copy] でハングアップする

各ノードに追加されたホストと構成ファイルを編集します

3 に答える 3

Related

Reference