neural-network - Q-Learning と関数近似を使用して GridWorld を解く

Question

私は単純な GridWorld (Russell & Norvig Ch. 21.2 で説明されているように 3x4) の問題を研究しています。Q-Learning と QTable を使用して解決しましたが、行列の代わりに関数近似を使用したいと考えています。

私は MATLAB を使用しており、ニューラルネットワークと決定木の両方を試しましたが、期待した結果が得られませんでした。つまり、不適切なポリシーが見つかりました。このトピックに関するいくつかの論文を読んだことがありますが、それらのほとんどは理論的なものであり、実際の実装についてはあまり詳しく説明していません。

オフライン学習の方が簡単なので、私はオフライン学習を使用しています。私のアプローチは次のようになります。

決定木 (または NN) を 16 の入力バイナリユニット (グリッド内の各位置に 1 つずつ) と 4 つの可能なアクション (上、下、左、右) で初期化します。
多くの反復を行い、それぞれの qstate と計算された qvalue をトレーニングセットに保存します。
トレーニングセットを使用して決定木 (または NN) をトレーニングします。
トレーニングセットを消去し、ステップ 2 から繰り返します。トレーニングした決定木 (または NN) を使用して qvalues を計算します。

単純すぎて真実ではないように思えますが、実際には期待した結果が得られません。以下に MATLAB コードをいくつか示します。

retrain = 1;
if(retrain) 
    x = zeros(1, 16); %This is my training set
    y = 0;
    t = 0; %Iterations
end
tree = fitrtree(x, y);
x = zeros(1, 16);
y = 0;
for i=1:100
    %Get the initial game state as a 3x4 matrix
    gamestate = initialstate();
    end = 0;
    while (end == 0)
        t = t + 1; %Increase the iteration

        %Get the index of the best action to take
        index = chooseaction(gamestate, tree);

        %Make the action and get the new game state and reward
        [newgamestate, reward] = makeaction(gamestate, index);

        %Get the state-action vector for the current gamestate and chosen action
        sa_pair = statetopair(gamestate, index);

        %Check for end of game
        if(isfinalstate(gamestate))
            end = 1;
            %Get the final reward
            reward = finalreward(gamestate);
            %Add a sample to the training set
            x(size(x, 1)+1, :) = sa_pair;
            y(size(y,  1)+1, 1) = updateq(reward, gamestate, index, newgamestate, tree, t, end);
        else
            %Add a sample to the training set
            x(size(x, 1)+1, :) = sa_pair;
            y(size(y, 1)+1, 1) = updateq(reward, gamestate, index, newgamestate, tree, t, end);
        end

        %Update gamestate
        gamestate = newgamestate;
    end
end

半分の確率でランダムなアクションを選択します。updateq関数は次のとおりです。

function [ q ] = updateq( reward, gamestate, index, newgamestate, tree, iteration, finalstate )

alfa = 1/iteration;
gamma = 0.99;

%Get the action with maximum qvalue in the new state s'
amax = chooseaction(newgamestate, tree);

%Get the corresponding state-action vectors
newsa_pair = statetopair(newgamestate, amax);    
sa_pair = statetopair(gamestate, index);

if(finalstate == 0)
    X = reward + gamma * predict(tree, newsa_pair);
else
    X = reward;
end

q = (1 - alfa) * predict(tree, sa_pair) + alfa * X;    

end

どんな提案でも大歓迎です！

score 2 · Accepted Answer

問題は、オフラインの Q-Learning では、データを収集するプロセスを少なくともn回繰り返す必要があることでした。nは、モデル化しようとしている問題によって異なります。各反復中に計算された qvalues を分析して考えてみると、なぜこれが必要なのかがすぐに明らかになります。

最初の反復では最終状態のみを学習し、2 回目の反復では最後から 2 番目の状態も学習し、3 回目の反復では最後から 2 番目の状態も学習します。最終状態から初期状態まで学習し、qvalues を逆伝播します。GridWorld の例では、ゲームを終了するために必要な訪問済みステートの最小数は 6 です。

最終的に、正しいアルゴリズムは次のようになります。

決定木 (または NN) を 16 の入力バイナリユニット (グリッド内の位置ごとに 1 つ) と 4 つの可能なアクション (上、下、左、右) で初期化します。
多くの反復を行い (この GridWorld の例では 30 ゲームで十分です)、それぞれの qstate と計算された qvalue をトレーニングセットに保存します。
トレーニングセットを使用して決定木 (または NN) をトレーニングします。
トレーニングセットを消去します。
ステップ2から繰り返します。トレーニングしたばかりの決定木 (または NN) を使用して qvaluesを少なくともn回計算します。nは問題によって異なります。この GridWorld の例ではnは 6 ですが、プロセスを 7 ～ 8 回繰り返すと、すべての状態でより良い結果が得られます。

neural-network - Q-Learning と関数近似を使用して GridWorld を解く

1 に答える 1

Related

Reference