spring - Hadoop Jobfactorybean、単一のHadoopノード上の複数のレデューサーを接続します

Question

私が達成したいこと：

いくつかの大きなファイルを処理するために、Hadoopタスクを含むSpringBatchジョブを設定しました。ジョブに対して複数のレデューサーを実行するには、setNumOfReduceTasksを使用してレデューサーの数を設定する必要があります。JobFactorybeanを介してこれを設定しようとしています。

クラスパス内の私のBean構成：/META-INF/spring/batch-common.xml：

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:p="http://www.springframework.org/schema/p"
    xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">

    <bean id="jobFactoryBean" class="org.springframework.data.hadoop.mapreduce.JobFactoryBean" p:numberReducers="5"/>
    <bean id="jobRepository" class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean" />
    <bean id="transactionManager" class="org.springframework.batch.support.transaction.ResourcelessTransactionManager"/>
    <bean id="jobLauncher" class="org.springframework.batch.core.launch.support.SimpleJobLauncher" p:jobRepository-ref="jobRepository" />
</beans>

XMLは次の方法で含まれています。

    <?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:context="http://www.springframework.org/schema/context"
    xsi:schemaLocation="
        http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
        http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.0.xsd">

    <context:property-placeholder location="classpath:batch.properties,classpath:hadoop.properties"
            ignore-resource-not-found="true" ignore-unresolvable="true" />


    <import resource="classpath:/META-INF/spring/batch-common.xml" />
    <import resource="classpath:/META-INF/spring/hadoop-context.xml" />
    <import resource="classpath:/META-INF/spring/sort-context.xml" />

</beans>

jUnitテスト用のBeanを次の方法で取得しています

    JobLauncher launcher = ctx.getBean(JobLauncher.class);
    Map<String, Job> jobs = ctx.getBeansOfType(Job.class);
    JobFactoryBean jfb = ctx.getBean(JobFactoryBean.class);

jUnitテストはエラーで停止します：

No bean named '&jobFactoryBean' is defined

つまり、JobFactoryBeanはロードされませんが、他のBeanは正しくロードされ、エラーは発生しません。

線なし

JobFactoryBean jfb = ctx.getBean(JobFactoryBean.class);

プロジェクトのテストは実行されますが、ジョブごとに1つのレデューサーしかありません。

方法

ctx.getBean("jobFactoryBean");

Hadoopジョブを返します。そこにfactoryBeanを取得することを期待しています...

それをテストするために、Reducerのコンストラクターを拡張して、Reducerの作成ごとにログを記録し、Reducerが生成されたときに通知を受け取ります。これまでのところ、ログに1つのエントリしかありません。

コアが2つ割り当てられたVMが2つあり、それぞれに2 GBのRAMがあり、ProjectGutenbergの複数の書籍で構成される75MBのファイルを並べ替えようとしています。

編集：

私が試したもう1つのことは、hadoopジョブのレデューサーの数をプロパティを介して設定することですが、結果はありません。

<job id="search-jobSherlockOk" input-path="${sherlock.input.path}"
    output-path="${sherlockOK.output.path}"
    mapper="com.romediusweiss.hadoopSort.mapReduce.SortMapperWords"
    reducer="com.romediusweiss.hadoopSort.mapReduce.SortBlockReducer"
    partitioner="com.romediusweiss.hadoopSort.mapReduce.SortPartitioner"
    number-reducers="2"
    validate-paths="false" />

mapreduce-site.xmlの設定は両方のノードにあります。

<property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>10</value>
</property>

...なぜ：

次のブログ投稿の例をコピーしたいと思います： http ：//www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

パーティショナーの動作をテストするには、同じマシンまたは完全に分散された環境に異なるレデューサーが必要です。最初のアプローチの方が簡単です。

追伸：評判の高いユーザーが「spring-data-hadoop」というタグを作成できますか？ありがとうございます。

score 1 · Accepted Answer

投稿されたSpringフォーラムの質問に回答しました（Spring Data Hadoopの質問に使用することをお勧めします）。

完全な答えはここhttp://forum.springsource.org/showthread.php?130500-Additional-Reducersにありますが、要するに、レデューサーの数は入力分割の数によって決まります。http://wiki.apache.org/hadoop/HowManyMapsAndReducesを参照してください

spring - Hadoop Jobfactorybean、単一のHadoopノード上の複数のレデューサーを接続します

1 に答える 1

Related

Reference