1

gym-retro と keras-rl の DQNAgent で Gradius をトレーニングしようとしているのですが、うまくいきません。報酬は増加せず、損失は増加し続けます。何が悪いのか理解できません。

出力の一部を以下に示します。

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 32, 30, 28)        8224      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 16, 15, 64)        28736     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 16, 15, 64)        36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 15360)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               3932416   
_________________________________________________________________
dense_2 (Dense)              (None, 36)                9252      
=================================================================
Total params: 4,015,556
Trainable params: 4,015,556
Non-trainable params: 0
_________________________________________________________________
None
Training for 1500000 steps ...
    2339/1500000: episode: 1, duration: 47.685s, episode steps: 2339, steps per second: 49, episode reward: 2500.000, mean reward: 1.069 [0.000, 100.000], mean action: 19.122 [0.000, 35.000], mean observation: 0.029 [0.000, 0.980], loss: 36.018083, mean_absolute_error: 11.380395, mean_q: 18.252860
    3936/1500000: episode: 2, duration: 51.391s, episode steps: 1597, steps per second: 31, episode reward: 1800.000, mean reward: 1.127 [0.000, 100.000], mean action: 19.312 [0.000, 35.000], mean observation: 0.027 [0.000, 0.980], loss: 64.386497, mean_absolute_error: 54.420486, mean_q: 68.424599
    6253/1500000: episode: 3, duration: 75.020s, episode steps: 2317, steps per second: 31, episode reward: 3500.000, mean reward: 1.511 [0.000, 100.000], mean action: 16.931 [0.000, 35.000], mean observation: 0.029 [0.000, 0.980], loss: 177.966461, mean_absolute_error: 153.478119, mean_q: 177.061630




#(snip)





 1493035/1500000: episode: 525, duration: 95.634s, episode steps: 2823, steps per second: 30, episode reward: 5100.000, mean reward: 1.807 [0.000, 500.000], mean action: 19.664 [0.000, 35.000], mean observation: 0.034 [0.000, 0.980], loss: 26501204410368.000000, mean_absolute_error: 86211024.000000, mean_q: 90254256.000000
 1495350/1500000: episode: 526, duration: 78.401s, episode steps: 2315, steps per second: 30, episode reward: 2500.000, mean reward: 1.080 [0.000, 100.000], mean action: 18.652 [0.000, 34.000], mean observation: 0.029 [0.000, 0.980], loss: 23247718449152.000000, mean_absolute_error: 84441184.000000, mean_q: 88424568.000000
 1497839/1500000: episode: 527, duration: 84.667s, episode steps: 2489, steps per second: 29, episode reward: 3700.000, mean reward: 1.487 [0.000, 500.000], mean action: 21.676 [0.000, 35.000], mean observation: 0.034 [0.000, 0.980], loss: 23432217493504.000000, mean_absolute_error: 80286264.000000, mean_q: 83946064.000000
done, took 49517.509 seconds
end!

プログラムは私の大学のサーバーで実行されており、サーバーを SSH で接続しています。

「ピップフリーズ」の結果は次のとおりです。

absl-py==0.7.1
alembic==1.0.10
asn1crypto==0.24.0
astor==0.8.0
async-generator==1.10
attrs==19.1.0
backcall==0.1.0
bleach==3.1.0
certifi==2019.3.9
certipy==0.1.3
cffi==1.12.3
chardet==3.0.4
cloudpickle==1.2.1
cryptography==2.6.1
cycler==0.10.0
decorator==4.4.0
defusedxml==0.6.0
EasyProcess==0.2.7
entrypoints==0.3
future==0.17.1
gast==0.2.2
google-pasta==0.1.7
grpcio==1.21.1
gym==0.13.0
gym-retro==0.7.0
h5py==2.9.0
idna==2.8
ipykernel==5.1.0
ipython==7.5.0
ipython-genutils==0.2.0
jedi==0.13.3
Jinja2==2.10.1
jsonschema==3.0.1
jupyter-client==5.2.4
jupyter-core==4.4.0
jupyterhub==1.0.0
jupyterhub-ldapauthenticator==1.2.2
jupyterlab==0.35.6
jupyterlab-server==0.2.0
Keras==2.2.4
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
ldap3==2.6
Mako==1.0.10
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.0.3
mistune==0.8.4
nbconvert==5.5.0
nbformat==4.4.0
notebook==5.7.8
numpy==1.16.4
oauthlib==3.0.1
pamela==1.0.0
pandocfilters==1.4.2
parso==0.4.0
pexpect==4.7.0
pickleshare==0.7.5
pipenv==2018.11.26
prometheus-client==0.6.0
prompt-toolkit==2.0.9
protobuf==3.8.0
ptyprocess==0.6.0
pyasn1==0.4.5
pycparser==2.19
pycurl==7.43.0
pyglet==1.3.2
Pygments==2.4.0
pygobject==3.20.0
pyOpenSSL==19.0.0
pyparsing==2.4.0
pyrsistent==0.15.2
python-apt==1.1.0b1+ubuntu0.16.4.2
python-dateutil==2.8.0
python-editor==1.0.4
PyVirtualDisplay==0.2.4
PyYAML==5.1.1
pyzmq==18.0.1
requests==2.21.0
scipy==1.3.0
Send2Trash==1.5.0
six==1.12.0
SQLAlchemy==1.3.3
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
tensorflow-gpu==1.14.0
termcolor==1.1.0
terminado==0.8.2
testpath==0.4.2
tornado==6.0.2
traitlets==4.3.2
unattended-upgrades==0.1
urllib3==1.24.3
virtualenv==16.5.0
virtualenv-clone==0.5.3
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.15.4
wrapt==1.11.2
xvfbwrapper==0.2.9

最初の conv2d レイヤーに何か問題があるのではないかと疑っています。おそらく SequentialMemory の window_length に関連しています。最初の conv2d レイヤーが正しく取得または畳み込まれないと考えています。そのため、CustomProcessor クラスの process_state_batch でバッチをソートしました。しかし、問題は解決されませんでした。

私が書いたすべてはここにあります。

#import all i need

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core
import sys
import gym
from PIL import Image

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

#set window size
win_size = (112,120)

#set log file
fo = open('log.txt', 'w')
sys.stdout = fo

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

"""
keras add extra dimension for batch.
and add history dimention for SequentialMemory.
but conv2d isn't able to accept 5D input.
so i'd made my processor class.
CustomProcessor convert RGB input into Gray input.

and conv2d layer convolute data which's shape (win_len, win_hei, history).
so transpose batch.

for more information, ref url bellow.
https://github.com/keras-rl/keras-rl/issues/229
"""

class CustomProcessor(rl.core.Processor):

    def process_observation(self, observation):
        img = Image.fromarray(observation)
        img = img.resize(win_size).convert('L')
        tes = np.array(img) / 255
        return np.array(img) / 255


    #def process_state_batch(self, batch):
        #batch = batch.transpose(0,2,3,1)
        #print(batch.shape)
        #return batch


myprocessor = CustomProcessor()

"""
Gradius have action space which can take 9 action in same moment.
so i gotta discrete action space.
the way i'd taken is wrapping env class.
"""

class Discretizer(gym.ActionWrapper):

    def __init__(self, env):
        super(Discretizer, self).__init__(env)
        self._actions = [[0,0,0,0,0,0,0,0,0],
               [0,0,0,0,0,0,0,0,1],
               [1,0,0,0,0,0,0,0,0],
               [1,0,0,0,0,0,0,0,1],
               [0,0,0,0,1,0,0,0,0],
               [0,0,0,0,0,1,0,0,0],
               [0,0,0,0,0,0,1,0,0],
               [0,0,0,0,0,0,0,1,0],
               [0,0,0,0,1,0,1,0,0],
               [0,0,0,0,1,0,0,1,0],
               [0,0,0,0,0,1,0,1,0],
               [0,0,0,0,0,1,1,0,0],]
        for i in range(8):
            self._actions.append((np.array(self._actions[1]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[2]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[3]) + np.array(self._actions[i + 4])).tolist())
        self.actions = []
        for action in self._actions:
            env.get_action_meaning(action)
        self.action_space = gym.spaces.Discrete(len(self._actions))

    def action(self, a):
        return self._actions[a].copy()

env = retro.make(game="Gradius-Nes", record="./Record")
env = Discretizer(env)

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

nb_actions = env.action_space.n

normal = k.initializers.glorot_normal()
model = k.Sequential()
win_len = 4
model.add(k.layers.Conv2D(
    32, kernel_size=8, strides=4, padding="same",
    input_shape=(4,120,112), kernel_initializer=normal,
    activation="relu", data_format='channels_first'))
print("chack")
model.add(k.layers.Conv2D(
    64, kernel_size=4, strides=2, padding="same",
    kernel_initializer=normal,
    activation="relu"))
model.add(k.layers.Conv2D(
    64, kernel_size=3, strides=1, padding="same",
    kernel_initializer=normal,
    activation="relu"))
model.add(k.layers.Flatten())
model.add(k.layers.Dense(256, kernel_initializer=normal,
                         activation="relu"))
model.add(k.layers.Dense(nb_actions,
                         kernel_initializer=normal,
                         activation="linear"))

memory = rl.memory.SequentialMemory(limit=50000, window_length=win_len)
policy = rl.policy.EpsGreedyQPolicy()

"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=1000,
               target_model_update=1e-2, policy=policy)

dqn.compile(k.optimizers.Adam(lr=1e-3), metrics=['mae'])
print(model.summary());
hist = dqn.fit(env, nb_steps=1500000, visualize=False, verbose=2)
print("end!")
dqn.save_weights("test_model.h5f", overwrite=True)

env.close()

PS:

これらの解決策を試しました。1、maxpooling レイヤーと密なレイヤーを追加します。 2、グラデーション クリッピングを使用します。 3、Adam の rl を下げますが、それでも機能しません。コードは以下です。

#import all i need

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core
import sys
import gym
from PIL import Image

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

#set window size
win_size = (224,240)

#set log file
#fo = open('log.txt', 'w')
#sys.stdout = fo

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

"""
keras add extra dimension for batch.
and add history dimention for SequentialMemory.
but conv2d isn't able to accept 5D input.
so i'd made my processor class.
CustomProcessor convert RGB input into Gray input.

and conv2d layer convolute data which's shape (win_len, win_hei, history).
so transpose batch.

for more information, ref url bellow.
https://github.com/keras-rl/keras-rl/issues/229
"""

class CustomProcessor(rl.core.Processor):

    def process_observation(self, observation):
        img = Image.fromarray(observation)
        img = img.resize(win_size).convert('L')
        tes = np.array(img) / 255
        return np.array(img) / 255


    #def process_state_batch(self, batch):
        #batch = batch.transpose(0,2,3,1)
        #print(batch.shape)
        #return batch


myprocessor = CustomProcessor()

"""
Gradius have action space which can take 9 action in same moment.
so i gotta discrete action space.
the way i'd taken is wrapping env class.
"""

class Discretizer(gym.ActionWrapper):

    def __init__(self, env):
        super(Discretizer, self).__init__(env)
        self._actions = [[0,0,0,0,0,0,0,0,0],
               [0,0,0,0,0,0,0,0,1],
               [1,0,0,0,0,0,0,0,0],
               [1,0,0,0,0,0,0,0,1],
               [0,0,0,0,1,0,0,0,0],
               [0,0,0,0,0,1,0,0,0],
               [0,0,0,0,0,0,1,0,0],
               [0,0,0,0,0,0,0,1,0],
               [0,0,0,0,1,0,1,0,0],
               [0,0,0,0,1,0,0,1,0],
               [0,0,0,0,0,1,0,1,0],
               [0,0,0,0,0,1,1,0,0],]
        for i in range(8):
            self._actions.append((np.array(self._actions[1]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[2]) + np.array(self._actions[i + 4])).tolist())
        for i in range(8):
            self._actions.append((np.array(self._actions[3]) + np.array(self._actions[i + 4])).tolist())
        self.actions = []
        for action in self._actions:
            env.get_action_meaning(action)
        self.action_space = gym.spaces.Discrete(len(self._actions))

    def action(self, a):
        return self._actions[a].copy()

env = retro.make(game="Gradius-Nes", record="./Record")
env = Discretizer(env)

import retro
import keras as k
import numpy as np
import rl
import rl.memory
import rl.policy
import rl.agents.dqn
import rl.core

import tensorflow as tf
from keras.backend import tensorflow_backend

config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
session = tf.Session(config=config)
tensorflow_backend.set_session(session)

nb_actions = env.action_space.n

normal = k.initializers.glorot_normal()
model = k.Sequential()
win_len = 4
model.add(k.layers.Conv2D(
    32, kernel_size=8, strides=4, padding="same",activation="relu", 
    input_shape=(win_len,240,224), kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Conv2D(
    64, kernel_size=4, strides=2, padding="same",activation="relu", 
    kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Conv2D(
    64, kernel_size=3, strides=1, padding="same",activation="relu", 
    kernel_initializer=normal, data_format="channels_first"))
model.add(k.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='same', data_format="channels_first"))
model.add(k.layers.Flatten())
model.add(k.layers.Dense(1024, kernel_initializer=normal, activation="relu"))
model.add(k.layers.Dense(1024, kernel_initializer=normal, activation="relu"))
model.add(k.layers.Dense(nb_actions,
                         kernel_initializer=normal,
                         activation="linear"))

memory = rl.memory.SequentialMemory(limit=50000, window_length=win_len)
policy = rl.policy.EpsGreedyQPolicy()

"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
"""
dqn = rl.agents.DQNAgent(processor=myprocessor, model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=50000,
               target_model_update=1e-6, policy=policy)

dqn.compile(k.optimizers.Adam(lr=1e-7, clipnorm=1.), metrics=['mae'])
print(model.summary());
hist = dqn.fit(env, nb_steps=750000, visualize=False, verbose=2)
print("end!")
dqn.save_weights("test_model.h5f", overwrite=True)

env.close()
4

0 に答える 0