Browse > Article
http://dx.doi.org/10.9708/jksci.2022.27.10.029

Text Classification Using Heterogeneous Knowledge Distillation  

Yu, Yerin (Graduate School of Business IT, Kookmin University)
Kim, Namgyu (Graduate School of Business IT, Kookmin University)
Abstract
Recently, with the development of deep learning technology, a variety of huge models with excellent performance have been devised by pre-training massive amounts of text data. However, in order for such a model to be applied to real-life services, the inference speed must be fast and the amount of computation must be low, so the technology for model compression is attracting attention. Knowledge distillation, a representative model compression, is attracting attention as it can be used in a variety of ways as a method of transferring the knowledge already learned by the teacher model to a relatively small-sized student model. However, knowledge distillation has a limitation in that it is difficult to solve problems with low similarity to previously learned data because only knowledge necessary for solving a given problem is learned in a teacher model and knowledge distillation to a student model is performed from the same point of view. Therefore, we propose a heterogeneous knowledge distillation method in which the teacher model learns a higher-level concept rather than the knowledge required for the task that the student model needs to solve, and the teacher model distills this knowledge to the student model. In addition, through classification experiments on about 18,000 documents, we confirmed that the heterogeneous knowledge distillation method showed superior performance in all aspects of learning efficiency and accuracy compared to the traditional knowledge distillation.
Keywords
Deep Learning; Knowledge Distillation; Text Classification; Model Compression;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 S. Lee and B. Song, "Graph-based Knowledge Distillation by Multi-head Attention Network," arXiv:1907.02226, Jul, 2019. DOI: 10.48550/arXiv.1907.02226   DOI
2 S. Kim and S. Kim, "Recursive Oversampling Method for Improving Classification Performance of Class Unbalanced Data in Patent Document Automatic Classification," Journal of The Institute of Electronics and Information Engineers, Vol. 58, No. 4, April, 2021. DOI: 10.5573/ieie.2021.58.4.43   DOI
3 S. Arora, M. M. Khapra, and H. G. Ramaswamy, "On Knowledge Distillation from Complex Networks for Response Prediction," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3813-3822, June, 2019. DOI: 10.18653/v1/N19-1382   DOI
4 H. Son, S. Choe, C. Moon, and J. MIn, "Rule-based filtering and deep learning LSTM e-mail spam classification," Proceedings of the Korean Information Science Society Conference, pp.105-107, 2021.
5 W. X. S. Wong, Y. Hyun, and N. Kim, "Improving the Accuracy of Document Classification by Learning Heterogeneity," Journal of Intelligence and Information Systems, Vol. 24, No. 3, Sep,2018. DOI: 10.13088/jiis.2018.24.3.021   DOI
6 S. U. Park, "Analysis of the Status of Natural Language Processing Technology Based on Deep Learning," Korean Journal of BigData, Vol. 6, Aug, 2021. DOI: 10.36498/kbigdt.2021.6.1.63   DOI
7 S. Shakeri, A. Sethy, and C. Cheng, "Knowledge Distillation in Document Retrieval," arXiv:1911.11065, Nov, 2019. DOI:10.48550/arXiv.1911.11065   DOI
8 S. Zhang, L. Jiang, and J. Tan, "Cross-domain knowledge distillation for text classification," Neurocomputing, Vol. 509, pp. 11-202022, Oct, 2022. DOI: 10.1016/j.neucom.2022.08.061   DOI
9 N. Kin, D. Lee, H. Choi, and W. X. S. Wong, "Investigations on Techniques and Applications of Text Analytics," Journal of Korean Institute of Communications and Information Sciences, Vol. 42, No. 2, pp. 471-492, Feb, 2017. DOI: 10.7840/kics.2017.42.2.471   DOI
10 B. Dipto and J. Gil, "Research Paper Classification Scheme based on Word Embedding," Proceedings of the Korea Information Processing Society Conference, Vol. 28, No. 2, Nov, 2021. DOI:10.3745/PKIPS.y2021m11a.494   DOI
11 J. Yim, D. Joo, J. Bae and J. Kim, "A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133-4141, 2017.DOI: 10.1109/cvpr.2017.754   DOI
12 H. L. Erickson, L. A. Lanning, and R. French, "Concept-Based Curriculum and Instruction for the Thinking Classroom," 2nd Edition, Corwin, 2017. DOI: 10.4135/9781506355382   DOI
13 G. Hinton, O. Vinyals, and J. Dean, "Distilling the Knowledge in a Neural Network," arXiv:1503.02531, Mar, 2015. DOI:10.48550/arXiv.1503.02531   DOI
14 J. Ba and R. Caruana, "Do deep nets really need to be deep?," Advances in neural information processing systems 27, 2014. DOI:10.48550/arXiv.1312.6184   DOI
15 S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, "Improved Knowledge Distillation via Teacher Assistant," Proceedings of the AAAI conference on artificial intelligence, Vol. 34, No. 04, pp. 5191-5198, April, 2020. DOI: 10.1609/aaai.v34i04.5963   DOI
16 A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, "FITNETS: HINTS FOR THIN DEEP NETS," arXiv:1412.65504, Mar, 2015. DOI: 10.48550/arXiv.1412.6550   DOI
17 B. Heo, M. Lee, S. Yun, and J. Y. Choi, "Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons," Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, No. 01, pp. 3779-3787, July, 2019. DOI:10.1609/aaai.v33i01.33013779   DOI
18 J. Kim, S. Park, and N. Kwak, "Paraphrasing Complex Network: Network Compression via Factor Transfer," Advances in neural information processing systems 31, 2018. DOI: 10.48550/arXiv.1802.04977   DOI
19 T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, Sep, 2013. DOI: 10.48550/arXiv.1301.3781   DOI
20 K. Clark, M. T. Luong, Q. V. Le, and C. D. Manning, "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators," arXiv:2003.10555, Mar, 2020. DOI:10.48550/arXiv.2003.10555   DOI
21 V. Sanh, L. Debut, J. Chaumond, and T. Wolf, "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter," arXiv:1910.01108, Mar, 2020. DOI: 10.48550/arXiv.1910.01108   DOI
22 X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, "TinyBERT: Distilling BERT for Natural Language Understanding," arXiv:1909.10351, Oct, 2020. DOI: 10.48550/arXiv.1909.10351   DOI
23 K. Lang, "NewsWeeder: Learning to Filter Netnews," Proceedings of the Twelfth International Conference on Machine Learning, pp. 331-339, 1995. DOI: 10.1016/B978-1-55860-377-6.50048-7   DOI
24 S. Ji, J. Moon, H. Kim, and E. Hwang, "A Twitter News-Classification Scheme Using Semantic Enrichment of Word Features," Journal of KIISE, Vol. 45, No. 10, pp. 1045-1055, Oct, 2018. DOI: 10.5626/JOK.2018.45.10.1045   DOI
25 Y. C. Chen, Z. Gan, Y. Cheng, J. Liu, and J. Liu, "Distilling Knowledge Learned in BERT for Text Generation," arXiv:1911. 03829, Jul, 2020. DOI: 10.48550/arXiv.1911.03829   DOI
26 Z. Huang and N. Wang, "Like What You Like: Knowledge Distill via Neuron Selectivity Transfer," arXiv:1707.01219, Dec, 2017. DOI: 10.48550/arXiv.1707.01219   DOI
27 Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, and Y. Duan, "Knowledge distillation via instance relationship graph," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7096-7104, 2019. DOI: 10.1109/cvpr.2019.00726   DOI
28 J. Gou, B. Yu, S. J. Maybank, and D. Tao, "Knowledge Distillation: A Survey," International Journal of Computer Vision 129.6, pp. 1789-1819, Mar, 2021. DOI: 10.1007/s11263-021-01453-z   DOI
29 J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, May, 2019. DOI: 10.48550/arXiv.1810.04805   DOI
30 S. Hahn and H. Choi, "Self-Knowledge Distillation in Natural Language Processing," arXiv:1908.01851, Aug, 2019. DOI:10.48550/arXiv.1908.01851   DOI