Browse > Article
http://dx.doi.org/10.3837/tiis.2022.12.004

Malaysian Name-based Ethnicity Classification using LSTM  

Hur, Youngbum (Department of Industrial Engineering, Inha University)
Publication Information
KSII Transactions on Internet and Information Systems (TIIS) / v.16, no.12, 2022 , pp. 3855-3867 More about this Journal
Abstract
Name separation (splitting full names into surnames and given names) is not a tedious task in a multiethnic country because the procedure for splitting surnames and given names is ethnicity-specific. Malaysia has multiple main ethnic groups; therefore, separating Malaysian full names into surnames and given names proves a challenge. In this study, we develop a two-phase framework for Malaysian name separation using deep learning. In the initial phase, we predict the ethnicity of full names. We propose a recurrent neural network with long short-term memory network-based model with character embeddings for prediction. Based on the predicted ethnicity, we use a rule-based algorithm for splitting full names into surnames and given names in the second phase. We evaluate the performance of the proposed model against various machine learning models and demonstrate that it outperforms them by an average of 9%. Moreover, transfer learning and fine-tuning of the proposed model with an additional dataset results in an improvement of up to 7% on average.
Keywords
Deep Learning; Recurrent Neural Network; LSTM; Machine Learning; Ethnicity Classification; Malaysian Name Separation; Deep Learning-based Name Separation;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Proc. of the 26th International Conference on Neural Information Processing Systems, Vol. 2, pp. 3111-3119, Dec. 2013.
2 E. Burchard, E. Ziv, N. Coyle, S. Gomez, H. Tang, A. Karter, J. Mountain, E. P'erez-Stable, D. Sheppard, and N. Risch, "The importance of race and ethnic background in biomedical research and clinical practice," New England Journal of Medicine, vol. 348, no. 12, pp. 1170-1175, Mar. 2003.   DOI
3 R.W. Buechley, "Generally useful ethnic search system "GUESS"," New York, pp. 49-58, 1976.
4 A.J. Coldman, T. Braun, and R.P. Gallagher, "The classification of ethnic status using name information," Journal of Epidemiology and Community Health, vol. 42, no. 4, pp. 390-395, Dec. 1988.   DOI
5 J. Chang, I. Rosenn, L. Backstrom, and C. Marlow, "epluribus: Ethnicity on Social Networks," ICWSM, vol. 4, no.1, pp. 18-25, May. 2010.   DOI
6 P. Treeratpituk, and C.L. Giles, "Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching," in Proc. of AAAI, pp. 1141-1147, Jul. 2012.
7 Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia, "Detecting automation of twitter accounts: Are you a human, bot, or cyborg?," IEEE Transactions on Dependable and Secure Computing, vol. 9, no. 6, pp. 811-824, Aug. 2012.   DOI
8 A. Ambekar, C. Ward, J. Mohammed, S. Male, and S. Skiena, "Name-Ethnicity Classification from Open Sources," in Proc. of KDD, pp. 49-58, Jun. 2009.
9 N. Jindal, and B. Liu, "Review spam detection," in Proc. of the 16th international conference on World Wide Web. ACM, pp. 1189-1190, May. 2007.
10 E. Ngai, Y. Hu, Y. Wong, Y. Chen, and X. Sun, "The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature," Decision Support Systems, vol. 50, no. 3, pp. 559-569, Feb. 2011.   DOI
11 D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
12 Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult," IEEE transactions on neural networks, vol. 5, no. 2, pp. 157-166, Mar. 1994.   DOI
13 S. Min, M. Seo, and H. Hajishirzi, "Question Answering through Transfer Learning from Large Fine-grained Supervision Data," in Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pp. 510-517, Jul. 2017.
14 W. Zaremba, I. Sutskever, and O. Vinyals, "Recurrent neural network regularization," arXiv preprint arXiv:1409.2329, 2014.
15 S. Hochreiter, J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, Nov. 1997.   DOI
16 K. Cho, B. Merrienboer, D. Bahdanau, and Y. Bengio, "On the properties of neural machine translation: Encoder-decoder approaches," arXiv preprint arXiv:1409.1259, 2014.
17 J. Lee, H. Kim, M. Ko, D. Choi, J. Choi, and J. Kang, "Name Nationality Classification with Recurrent Neural Networks," in Proc. of International Joint Conference of Artificial Intelligence Organization, pp. 2081-2087, 2017.
18 Y. Yao, and Z. Huang, "Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation," in Proc. of International Conference on Neural Information Processing, pp. 345-353, Feb. 2016.
19 A. Dai, and Q. Le, "Semi-supervised sequence learning," in Proc. of the 28th International Conference on Neural Information Processing Systems, Vol. 2, pp. 3079-3087, Dec. 2015.
20 L. Mou, Z. Meng, R. Yan, G. Li, Y. Xu, L. Zhang, and Z. Jin, "How transferable are neural networks in NLP applications?," in Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 479-489, Nov. 2016.
21 A. Karatzoglou, D. Meyer, and K. Hornik, "Support vector machines in R," Journal of statistical software, vol. 15, no. 9, pp. 1-28, 2006.
22 K. Wong, O. Zaiane, F. Davis, and Y. Yasui, "A machine learning approach to predict ethnicity using personal name and census location in Canada," PLoS ONE, vol. 15, no. 11, 2020.
23 John F. Kolen, Stefan C. Kremer, "Gradient flow in recurrent nets: the difficulty of learning longterm dependencies," A Field to Guide to Dynamical Recurrent Neural Networks, pp. 237-243, 2001.
24 V. Selvaratnam, "Ethnicity, inequality, and higher education in Malaysia," Comparative Education Review, vol. 32, no. 2, pp. 173-196, 1988.   DOI
25 J. Kim, J. Kim, and J. Smith, "Ethnicity-based name partitioning for author name disambiguation using supervised machine learning," Journal of the Association for Information Science & Technology, vol. 72, no. 8, pp. 979-994, August 2021.   DOI
26 J. Howard, and S. Ruder, "Universal language model fine-tuning for text classification," in Proc. of the 56th Annual Meeting of the Association for Computational Linguistic, pp. 328-339, Jul. 2018.