python - 多くのレルスのネットワークを使用するとクロスエントロピー損失関数が巨大になるのはなぜですか?

Question

私はこの損失関数を持っています:

            loss_main = tf.reduce_mean(
                tf.nn.softmax_cross_entropy_with_logits(train_logits, train['labels']),
                name='loss_main',
            )

train_logits次のように構築されたパイプラインから定義されます。

    def build_logit_pipeline(data, include_dropout):
        # X --> *W1 --> +b1 --> relu --> *W2 --> +b2 ... --> softmax etc...
        pipeline = data

        for i in xrange(len(layer_sizes) - 1):
            last = i == len(layer_sizes) - 2
            with tf.name_scope("linear%d" % i):
                pipeline = tf.matmul(pipeline, weights[i])
                pipeline = tf.add(pipeline, biases[i])

            if not last:
                # insert relu after every one before the last
                with tf.name_scope("relu%d" % i):
                    pipeline = getattr(tf.nn, arg('act-func'))(pipeline)
                    if include_dropout and not arg('no-dropout'):
                        pipeline = tf.nn.dropout(pipeline, 0.5, name='dropout')

        return pipeline

、、およびは次のlayer_sizesようweightsにbiases構成されます。

    def make_weight(from_, to, name=None):
        return tf.Variable(tf.truncated_normal([from_, to], stddev=0.5), name=name)

    def make_bias(to, name=None):
        return tf.Variable(tf.truncated_normal([to], stddev=0.5), name=name)

    layer_sizes = [dataset.image_size**2] + arg('layers') + [dataset.num_classes]
    with tf.name_scope("parameters"):
        with tf.name_scope("weights"):
            weights = [make_weight(layer_sizes[i], layer_sizes[i+1], name="weights_%d" % i)
                       for i in xrange(len(layer_sizes) - 1)]

        with tf.name_scope("biases"):
            biases = [make_bias(layer_sizes[i + 1], name="biases_%d" % i)
                      for i in xrange(len(layer_sizes) - 1)]

arg('act-func')is relu の場合、relu の長いチェーンを構築すると ( arg('layers')beingのように[750, 750, 750, 750, 750, 750])、損失関数は巨大になります。

Global step: 0
Batch loss function: 28593700.000000

relus のチェーンが短い場合 (つまりarg('layers')is only [750])、損失関数は小さくなります。

Global step: 0
Batch loss function: 96.377831

私の質問は、なぜ損失関数がそれほど劇的に異なるのですか? 私が理解しているように、ロジットの出力はソフトマックス化されて確率分布になります。次に、クロスエントロピーが、この確率分布からワンホットラベルに決定されます。所有しているレルスの数を変更すると、この関数が変更されるのはなぜですか? 私は、各ネットワークは最初はほぼランダムに等しく間違っているはずであり、損失が大きくなりすぎることはないと考えています。

この損失関数には l2 損失が含まれていないため、重みとバイアスの数の増加はこれを考慮していないことに注意してください。

arg('act-func')代わりにasを使用するとtanh、この損失の増加は発生しません。予想どおり、ほぼ同じままです。

python - 多くのレルスのネットワークを使用するとクロスエントロピー損失関数が巨大になるのはなぜですか?

1 に答える 1

Related

Reference