DOI QR코드

DOI QR Code

Alleviation of Vanishing Gradient Problem Using Parametric Activation Functions

파라메트릭 활성함수를 이용한 기울기 소실 문제의 완화

  • Received : 2021.03.24
  • Accepted : 2021.05.11
  • Published : 2021.10.31

Abstract

Deep neural networks are widely used to solve various problems. However, the deep neural network with a deep hidden layer frequently has a vanishing gradient or exploding gradient problem, which is a major obstacle to learning the deep neural network. In this paper, we propose a parametric activation function to alleviate the vanishing gradient problem that can be caused by nonlinear activation function. The proposed parametric activation function can be obtained by applying a parameter that can convert the scale and location of the activation function according to the characteristics of the input data, and the loss function can be minimized without limiting the derivative of the activation function through the backpropagation process. Through the XOR problem with 10 hidden layers and the MNIST classification problem with 8 hidden layers, the performance of the original nonlinear and parametric activation functions was compared, and it was confirmed that the proposed parametric activation function has superior performance in alleviating the vanishing gradient.

심층신경망은 다양한 문제를 해결하는데 널리 사용되고 있다. 하지만 은닉층이 깊은 심층신경망을 학습하는 동안 빈번히 발생하는 기울기 소실 또는 폭주 문제는 심층신경망 학습의 큰 걸림돌이 되고 있다. 본 연구에서는 기울기 소실이 발생하는 원인 중 비선형활성함수에 의해 발생할 수 있는 기울기 소실 문제를 완화하기 위해 파라메트릭 활성함수를 제안한다. 제안된 파라메트릭 활성함수는 입력 데이터의 특성에 따라 활성함수의 크기 및 위치를 변환시킬 수 있는 파라미터를 적용하여 얻을 수 있으며 역전파과정을 통해 활성함수의 미분 크기에 제한이 없는 손실함수를 최소화되도록 학습시킬 수 있다. 은닉층 수가 10개인 XOR문제와 은닉층 수가 8개인 MNIST 분류문제를 통하여 기존 비선형활성함수와 파라메트릭활성함수의 성능을 비교하였고 제안한 파라메트릭 활성함수가 기울기 소실 완화에 우월한 성능을 가짐을 확인하였다.

Keywords

References

  1. Y. Bengio, I. Goodfellow, and A. Courville, "Deep learning," MIT Press, 2017.
  2. H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, "Visualizing the Loss Landscape of Neural Nets," arXiv: 1712.09913, 2018.
  3. S. Hochreiter, "Untersuchungen zu dynamischen neuronalen netzen," Diploma Thesis, Institut fur Informatik, Lehrstuhl Prof. Brauer, Technische Universit atMunchen, 1991.
  4. S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: The difficulty of learning long-term dependencies," IEEE, 2001.
  5. X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," Artificial Intelligence and Statistics, Vol.9, 2010.
  6. V. Nair and G. Hinton, "Rectified linear units improve restricted boltzmann machines," International Conference on Machine Learning, pp.807-814, 2010.
  7. N. Y. Kong, Y. M. Ko, and S. W. Ko, "Performance Improvement Method of Convolutional Neural Network Using Agile Activation Function," KIPS Transactions on Software and Data Engineering, Vol.9, No.7, pp.213-220, 2020. https://doi.org/10.3745/KTSDE.2020.9.7.213
  8. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, Vol.9, No.8, pp.1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
  9. F. A. Gers, J. Schmidhuber, and F. Cummins, "Learning to forget: Continual prediction with LSTM," Neural Computation, Vol.12, No.10, pp.2451-2471, 2000, https://doi.org/10.1162/089976600300015015
  10. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv:1412.3555, 2014.
  11. K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," arXiv:1502.01852, 2015.
  12. G. E. Hinton, S. Osindero, and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, Vol.18, No.7, pp.1527-1554, 2006. https://doi.org/10.1162/neco.2006.18.7.1527
  13. J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methodsfor online learning and stochastic optimization," The Journal of Machine Learning Research, Vol.12, No.61, pp.2121-2159, 2011.
  14. M. D. Zeiler, "ADADELTA: An adaptive learning ratemethod," arXiv:1212.5701, 2012.
  15. D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980, 2014.
  16. S. Kong and M. Takatsuka, "Hexpo: A vanishing-proof activation function," International Joint Conference on Neural Networks, pp.2562-2567, 2017.
  17. Y. Qin, X. Wang, and J. Zou, "The optimized deep belief networkswith improved logistic Sigmoid units and their application in faultdiagnosis for planetary gearboxes of wind turbines," Institute of Electrical and Electronics Engineers, Vol.66, No.5, pp.3814-3824, 2018.
  18. X. Wang, Y. Qin, Y. Wang, S. Xiang, and H. Chen, "ReLTanh: An activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis," Neurocomputing, Vol.363, pp.88-98, 2019. https://doi.org/10.1016/j.neucom.2019.07.017
  19. R. Pascanu, T. Mikolov, and Y. Bengio, "Understanding the exploding gradient problem," arXiv:1211.5063, 2012.
  20. R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," arXiv:1211.5063, 2013.
  21. B. Xu, N. Wang, T. Chen, and M. Li, "Empirical Evaluation of Rectified Activations in Convolution Networkm," arXiv: 1505.00853, 2015.
  22. S. Basodi, C. Ji, H. Zhang, and Y. Pan, "Gradient amplification: An efficient way to train deep neural networks," Big Data mining and Analytics, Vol.3, No.3, pp.196-207, 2020. https://doi.org/10.26599/bdma.2020.9020004