[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTSDE.2021.10.10.407

Alleviation of Vanishing Gradient Problem Using Parametric Activation Functions

Ko, Young Min (전주대학교 인공지능학과)
Ko, Sun Woo (전주대학교 인공지능학과)

Publication Information

KIPS Transactions on Software and Data Engineering / v.10, no.10, 2021 , pp. 407-420 More about this Journal

Abstract

Deep neural networks are widely used to solve various problems. However, the deep neural network with a deep hidden layer frequently has a vanishing gradient or exploding gradient problem, which is a major obstacle to learning the deep neural network. In this paper, we propose a parametric activation function to alleviate the vanishing gradient problem that can be caused by nonlinear activation function. The proposed parametric activation function can be obtained by applying a parameter that can convert the scale and location of the activation function according to the characteristics of the input data, and the loss function can be minimized without limiting the derivative of the activation function through the backpropagation process. Through the XOR problem with 10 hidden layers and the MNIST classification problem with 8 hidden layers, the performance of the original nonlinear and parametric activation functions was compared, and it was confirmed that the proposed parametric activation function has superior performance in alleviating the vanishing gradient.

Keywords

Deep Neural Network; Vanishing Gradient Problem; Parametric Activation Function; Backpropagation; Learning;

Citations & Related Records

Reference

1	F. A. Gers, J. Schmidhuber, and F. Cummins, "Learning to forget: Continual prediction with LSTM," Neural Computation, Vol.12, No.10, pp.2451-2471, 2000, DOI
2	J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv:1412.3555, 2014.
3	K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," arXiv:1502.01852, 2015.
4	G. E. Hinton, S. Osindero, and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, Vol.18, No.7, pp.1527-1554, 2006. DOI
5	J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methodsfor online learning and stochastic optimization," The Journal of Machine Learning Research, Vol.12, No.61, pp.2121-2159, 2011.
6	M. D. Zeiler, "ADADELTA: An adaptive learning ratemethod," arXiv:1212.5701, 2012.
7	D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980, 2014.
8	S. Kong and M. Takatsuka, "Hexpo: A vanishing-proof activation function," International Joint Conference on Neural Networks, pp.2562-2567, 2017.
9	X. Wang, Y. Qin, Y. Wang, S. Xiang, and H. Chen, "ReLTanh: An activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis," Neurocomputing, Vol.363, pp.88-98, 2019. DOI
10	R. Pascanu, T. Mikolov, and Y. Bengio, "Understanding the exploding gradient problem," arXiv:1211.5063, 2012.
11	R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," arXiv:1211.5063, 2013.
12	B. Xu, N. Wang, T. Chen, and M. Li, "Empirical Evaluation of Rectified Activations in Convolution Networkm," arXiv: 1505.00853, 2015.
13	S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: The difficulty of learning long-term dependencies," IEEE, 2001.
14	Y. Bengio, I. Goodfellow, and A. Courville, "Deep learning," MIT Press, 2017.
15	H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, "Visualizing the Loss Landscape of Neural Nets," arXiv: 1712.09913, 2018.
16	S. Hochreiter, "Untersuchungen zu dynamischen neuronalen netzen," Diploma Thesis, Institut fur Informatik, Lehrstuhl Prof. Brauer, Technische Universit atMunchen, 1991.
17	X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," Artificial Intelligence and Statistics, Vol.9, 2010.
18	V. Nair and G. Hinton, "Rectified linear units improve restricted boltzmann machines," International Conference on Machine Learning, pp.807-814, 2010.
19	S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, Vol.9, No.8, pp.1735-1780, 1997. DOI
20	Y. Qin, X. Wang, and J. Zou, "The optimized deep belief networkswith improved logistic Sigmoid units and their application in faultdiagnosis for planetary gearboxes of wind turbines," Institute of Electrical and Electronics Engineers, Vol.66, No.5, pp.3814-3824, 2018.
21	S. Basodi, C. Ji, H. Zhang, and Y. Pan, "Gradient amplification: An efficient way to train deep neural networks," Big Data mining and Analytics, Vol.3, No.3, pp.196-207, 2020. DOI
22	N. Y. Kong, Y. M. Ko, and S. W. Ko, "Performance Improvement Method of Convolutional Neural Network Using Agile Activation Function," KIPS Transactions on Software and Data Engineering, Vol.9, No.7, pp.213-220, 2020. DOI

KSCI

Alleviation of Vanishing Gradient Problem Using Parametric Activation Functions 파라메트릭 활성함수를 이용한 기울기 소실 문제의 완화

Alleviation of Vanishing Gradient Problem Using Parametric Activation Functions