I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [3.0374929e-06 0.0059775524 0.980205...] step0, training accuracy 0.04 I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [9.2028862e-10 1.4812358e-05 0.044873074...] I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [648.49146] I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.024463326 1.4828938e-31 0...] step1, training accuracy 0.2 I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [2.4634053e-11 3.3087209e-34 0...] I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [nan] step2, training accuracy 0.14 I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [nan nan nan...] W tensorflow/core/common_runtime/executor.cc:1027] 0x7ff51d92a940 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
当learning rate $=1e-4$时,程序不会报错。
1 2 3 4 5 6 7 8 9 10 11 12
I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.00056920078 8.4922984e-09 0.00033719366...] step0, training accuracy 0.14 I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [7.0613837e-10 9.28294e-09 0.00016230672...] I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [439.95135] step1, training accuracy 0.16 I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.031509314 3.6221365e-05 0.015359053...] I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [3.7112056e-07 1.8543299e-09 8.9234991e-06...] I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [436.37653] step2, training accuracy 0.12 I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [0.015578311 0.0026688741 0.44736364...] I tensorflow/core/kernels/logging_ops.cc:64] y_conv: [6.0428465e-07 0.0001744287 0.026451336...] I tensorflow/core/kernels/logging_ops.cc:64] cross_entropy: [385.33765]
During experimentation, once the gradient value grows extremely large, it causes an overflow (i.e. NaN) which is easily detectable at runtime; this issue is called the Gradient Explosion Problem.
解决方法
适当减小学习速率
加入Gradient clipping的方法。 Gradient clipping的方法最早是由Thomas Mikolov提出的。每当梯度达到一定的阈值,就把他们设置回一个小一些的数字。 Refer to the lecture note of Stanford CS 224D, use gradient clipping.
To solve the problem of exploding gradients, Thomas Mikolov first introduced a simple heuristic solution that clips gradients to a small number whenever they explode. That is, whenever they reach a certain threshold, they are set back to a small number as shown in Algorithm 1.
Algorithm 1: $\frac{\partial E}{\partial W}\to g$ if $ \Vert g\Vert\ge threshold$ then $\frac {threshold}{\Vert g\Vert} g\to g$ end if