python-2.7 - 強化学習の Softmax アクション選択を実装するこれよりも良い方法はありますか?

Question

強化学習タスク ( http://www.incompleteideas.net/book/ebook/node17.html ) の Softmax アクション選択ポリシーを実装しています。

私はこの解決策にたどり着きましたが、改善の余地があると思います。

1-ここで確率を評価します

    prob_t = [0]*3
    denominator = 0
    for a in range(nActions):
        denominator += exp(Q[state][a] / temperature) 

    for a in range(nActions):
        prob_t[a] = (exp(Q[state][a]/temperature))/denominator

2-ここでは、ランダムに生成された ]0,1[ の範囲の数値をアクションの確率値と比較しています。

    rand_action = random.random()
    if rand_action < prob_t[0]:
        action = 0      
    elif rand_action >= prob_t[0] and rand_action < prob_t[1]+prob_t[0]:
        action = 1      
    else: #if rand_action >= prob_t[1]+prob_t[0]
        action = 2

編集：

例: rand_action は 0.78、prob_t[0] は 0.25、prob_t[1] は 0.35、prob_t[2] は 0.4 です。確率の合計は 1 になります。0.78 はアクション 0 と 1 の確率の合計 (prob_t[0] + prob_t[1]) より大きいため、アクション 2 が選択されます。

これを行うより効率的な方法はありますか？

score 1 · Accepted Answer

各アクションの確率を評価した後、加重ランダム選択を返す関数がある場合、次のように目的のアクションを取得できます。

action = weighted_choice(prob_t)

これがあなたが「より良い方法」と呼ぶものかどうかはわかりませんが。

は次のweighted_choiceようになります。

import random
def weighted_choice(weights):
    totals = []
    running_total = 0

    for w in weights:
        running_total += w
        totals.append(running_total)

    rnd = random.random() * running_total
    for i, total in enumerate(totals):
        if rnd < total:
            return i

利用可能なアクションが多数ある場合は、上記の線形検索ではなく、記事のバイナリ検索の実装を必ず確認してください。

または、 numpyにアクセスできる場合:

import numpy as np
def weighted_choice(weights):
    totals = np.cumsum(weights)
    norm = totals[-1]
    throw = np.random.rand()*norm
    return np.searchsorted(totals, throw)

score 0 · Accepted Answer

numpy を使用するよう提案された後、少し調査を行い、ソフトマックス実装の最初の部分でこのソリューションを使用しました。

prob_t = [0,0,0]       #initialise
for a in range(nActions):
    prob_t[a] = np.exp(Q[state][a]/temperature)  #calculate numerators

#numpy matrix element-wise division for denominator (sum of numerators)
prob_t = np.true_divide(prob_t,sum(prob_t))

私の最初の解決策よりも少ない for ループがあります。私が理解できる唯一の欠点は、精度の低下です。

numpy を使用:

[ 0.02645082  0.02645082  0.94709836]

最初の 2 ループソリューション:

[0.02645082063629476, 0.02645082063629476, 0.9470983587274104]

python-2.7 - 強化学習の Softmax アクション選択を実装するこれよりも良い方法はありますか?

3 に答える 3

Related

Reference