Browse > Article
http://dx.doi.org/10.5762/KAIS.2020.21.2.189

Comparison of Gradient Descent for Deep Learning  

Kang, Min-Jae (Department of Electronic Engineering, Juju National University)
Publication Information
Journal of the Korea Academia-Industrial cooperation Society / v.21, no.2, 2020 , pp. 189-194 More about this Journal
Abstract
This paper analyzes the gradient descent method, which is the one most used for learning neural networks. Learning means updating a parameter so the loss function is at its minimum. The loss function quantifies the difference between actual and predicted values. The gradient descent method uses the slope of the loss function to update the parameter to minimize error, and is currently used in libraries that provide the best deep learning algorithms. However, these algorithms are provided in the form of a black box, making it difficult to identify the advantages and disadvantages of various gradient descent methods. This paper analyzes the characteristics of the stochastic gradient descent method, the momentum method, the AdaGrad method, and the Adadelta method, which are currently used gradient descent methods. The experimental data used a modified National Institute of Standards and Technology (MNIST) data set that is widely used to verify neural networks. The hidden layer consists of two layers: the first with 500 neurons, and the second with 300. The activation function of the output layer is the softmax function, and the rectified linear unit function is used for the remaining input and hidden layers. The loss function uses cross-entropy error.
Keywords
Gradient Descent Method; Deep Learning; Neural Networks; MNIST; Softmax; Cross-Entropy;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Smith, Craig S, "The Man Who Helped Turn Toronto into a High-Tech Hotbed". The New York Times. Retrieved 27 June 2017.
2 J. Liang, E. Meyerson, and R. Miikkulainen. Evolutionary architecture search for deep multitask networks. arXiv preprint arXiv:1803.03745, 2018.
3 J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, "Large scale distributed deep net works," in NIPS, 2012.
4 T. Schaul, S. Zhang, and Y. LeCun, "No more pesky learning rates," arXiv:1206.1106, 2012.
5 N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, "Application of pretrained deep neural networks to large vocabulary speech recognition," in Interspeech, 2012.
6 G. Morse and K. O. Stanley. Simple evolutionary optimization can rival stochastic gradient descent in neural networks. In The Genetic and Evolutionary Computation Conference (GECCO), pages 477-484, 2016.
7 J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," in COLT, 2010.
8 S. Becker and Y. LeCun, "Improving the convergence of back-propagation learning with second order methods," Tech. Rep., Department of Computer Science, University of Toronto, Toronto, ON, Canada, 1988.