Area-wise relational knowledge distillation

Sungchul Cho;Sangje Park;Changwon Lim;

doi:10.29220/CSAM.2023.30.5.501

Communications for Statistical Applications and Methods

Volume 30 Issue 5
/
Pages.501-516
/
2023
/
2287-7843(pISSN)
/
2383-4757(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Area-wise relational knowledge distillation

Sungchul Cho (Department of Applied Statistics, Chung-Ang University) ;
Sangje Park (Department of Applied Statistics, Chung-Ang University) ;
Changwon Lim (Department of Applied Statistics, Chung-Ang University)

Received : 2023.03.22
Accepted : 2023.04.10
Published : 2023.09.30

https://doi.org/10.29220/CSAM.2023.30.5.501 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Knowledge distillation (KD) refers to extracting knowledge from a large and complex model (teacher) and transferring it to a relatively small model (student). This can be done by training the teacher model to obtain the activation function values of the hidden or the output layers and then retraining the student model using the same training data with the obtained values. Recently, relational KD (RKD) has been proposed to extract knowledge about relative differences in training data. This method improved the performance of the student model compared to conventional KDs. In this paper, we propose a new method for RKD by introducing a new loss function for RKD. The proposed loss function is defined using the area difference between the teacher model and the student model in a specific hidden layer, and it is shown that the model can be successfully compressed, and the generalization performance of the model can be improved. We demonstrate that the accuracy of the model applying the method proposed in the study of model compression of audio data is up to 1.8% higher than that of the existing method. For the study of model generalization, we demonstrate that the model has up to 0.5% better performance in accuracy when introducing the RKD method to self-KD using image data.

Keywords

Acknowledgement

This research was supported by the Chung-Ang University research grant in 2020. This research was also supported by Next-Generation Information Computing Development Program through the National Research Foundation (NRF) of Korea and the NRF grant funded by the Ministry of Science, ICT (NRF-2017M3C4A7083281, NRF-2021R1F1A1056516).

References

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, and Kudlur M (2016). Tensorflow: A system for large-scale machine learning. In Proceedings of 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), Savannah, GA, USA, 265-283.
Bucilua C, Caruana R, and Niculescu-Mizil A (2006). Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 535-541.
Deng J, Dong W, Socher R, Li LJ, Li K, and Fei-Fei L (2009). Imagenet: A large-scale hierarchical image database. In Proceedigns of 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 248-255.
Edunov S, Ott M, Auli M, and Grangier D (2018). Understanding back-translation at scale, Available from: arXiv preprint arXiv:1808.09381
Furlanello T, Lipton ZC, Tschannen M, Itti L, and Anandkumar A (2018). Born again neural networks, Available from: arXiv preprint arXiv:1805.04770
Han S, Mao H, and Dally WJ (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, In Proceedings of 4th International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico, Available from: arXiv preprint arXiv:1510.00149
He K, Zhang X, Ren S, and Sun J (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 770-778.
Hinton G, Vinyals O, and Dean J (2015). Distilling the knowledge in a neural network, Available from: arXiv preprint arXiv:1503.02531
Krizhevsky A and Hinton G (2009). Learning multiple layers of features from tiny images (Technical report), University of Toronto, 1, 7, Available from: https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf
LeCun Y, Bottou L, Bengio Y, and Haffner P (1998). Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86, 2278-2324. https://doi.org/10.1109/5.726791
Logan B (2000). Mel frequency cepstral coefficients for music modeling, ISMIR, 270, 1-11.
McFee B, Raffel C, Liang D, Ellis D, McVicar M, Battenberg E, and Nieto O (2015). Librosa: Audio and Music Signal Analysis in Python. In Proceedings of the 14th Python in Science Conference, Austin, Texas, USA, 18-24.
Muller M and Ewert S (2011) Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proceedings of the International Conference on Music Information Retrieval, Miami, Florida, USA, 215-220.
Nesterov YE (1983). A method for solving the convex programming problem with convergence rate O(1/k²), Doklady Akademii Nauk SSSR, 269, 543-547.
Park W, Kim D, Lu Y, and Cho M (2019). Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 3967-3976.
Piczak KJ (2015). Environmental sound classification with convolutional neural networks. In Proceedings of 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing, Boston, MA, USA, 1-6.
Salamon J, Jacoby C, and Bello JP (2014). A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, Florida, USA, 1041-1044.
Saon G, Kurata G, Sercu T et al. (2017). English conversational telephone speech recognition by humans and machines, In Proceedings of 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 132-136, Available from: arXiv preprint arXiv:1703.02136
Shepard R (1964). Circularity in judgments of relative pitch, The Journal of the Acoustical Society of America, 36, 2346-2353. https://doi.org/10.1121/1.1919362
Tan M and Le QV (2019). EfficientNet: Rethinking model scaling for convolutional neural networks, In Proceedings of 36th International Conference on Machine Learning, Long Beach, CA, USA, 6105-6114, Avalilable from: arXiv preprint arXiv:1905.11946
Zhang Z, Xing F, Su H, Shi X, and Yang L (2017). Recent advances in the applications of convolutional neural networks to medical image contour detection, Available from: https://doi.org/10.48550/arXiv.1708.07281

Communications for Statistical Applications and Methods

Area-wise relational knowledge distillation

Abstract

Keywords

Acknowledgement

References

Detail Search