DOI QR코드

DOI QR Code

A Study on Sex Classification of a Name using Naive Bayesian

나이브 베이지안을 사용한 성명에 대한 성별 구분 연구

  • 임명재 (을지대학교 의료IT마케팅학과) ;
  • 정진표 (을지대학교 의료IT마케팅학과) ;
  • 김명관 (을지대학교 의료IT마케팅학과)
  • Received : 2013.11.08
  • Accepted : 2013.12.13
  • Published : 2013.12.31

Abstract

This article employs Naive Bayesian Classifier to realize a system that can distinguish the sex of a name. Unlike foreign names, in Korean names, the pronoun referring to a person shows discordance with sex. With the characteristics of Korean names, however, the study distinguishes names frequently used for men and for women. And as it also includes names of which sex is rather ambiguous such as proper nouns, the accuracy of it is somewhat low. The result of the experiment conducted in this article indicates 84% accuracy for Korean men and 88% for Korean women; thus, the total accuracy equals 86%. Meanwhile, about foreign names, men show 80% accuracy, and women 84%, so the total accuracy equals 83%.

본 논문은 Naive Bayesian분류기를 사용하여 성명의 성별을 구분하는 시스템을 구현 하였다. 국내인 성명은 외국인 성명과는 다르게 사람을 지칭할 때 쓰는 대명사의 성별불일치 현상이 있다. 하지만 국내인 성명의 특성으로 남자로 자주 쓰이는 이름과 여자로 자주쓰이는 이름을 구분하게 하였다. 그리고 고유명사등, 성별이 애매한 이름들도 포함하였기 때문에 다소 정확율이 떨어지는 것을 확인 할 수가 있었다. 본 논문의 실험 결과로는 국내인 남자는 84%, 여자는 88%의 정확율을 보였으며, 총합 86%의 정확율과 외국인 성명은 남자는 80%, 여자는 84%로 총합 83%의 정확율을 보이고 있다.

Keywords

References

  1. D. K. Lee, J. H. Kwon, "Social Search Algorithm considering Recent Interests of User", Journal of Korean Institute of Information Technology, vol. 9, issue 4, pp. 187-194, Apr 2011.
  2. Y. H Kang, B. I. Kho, Y. H. Seo, "Unregistered Human Names Recognition and Sex Distinction", Dept. Computer Science, Chongbook Univ., 2004.
  3. K. H. Lee, J. H. Lee, M. S. Choi, K. C. Kim, "Korean Named Entity Recognition Based on Supervised Learning Using Named Entily Construction Princip", The 14th Annual Conference on Human & Cognitive Language Technology, pp. 111-117, 2000.
  4. K. H. Lee, J. H. Lee, M. S. Choi, K. C. Kim,, "Study on Named Entity Recognition in Korean Text", The 12th Annual Conference on Human & Cognitive Language Technology, pp. 292-299, 2004.
  5. J. H. Lee, "The Role of Syntactic Cues in Pronoun Referential Resolution: The Effects of Number Cue and Gender Cue", Cognitive Science 15, pp. 25-33, 2004.
  6. Park, S.-B. and H.-G. Yoon. Determining the Gender of Korean Names for Pronoun Generation., World Academy of Science, Engineering and Technology 32, pp. 42-46. 2007.
  7. T. H. Kim, H. S. Lee, Y. S. Ha, M. H. Lee, S. H. Meang, "Proper Noun Extraction Using Data Sets",The 12th Annual Conference on Human & Cognitive Language Technology, pp. 11-18, 2000.
  8. H. M. Shin, "Gender Inference in Korean Newspaper Reading", The British & American Language & Literature Association of Korea 96, pp.161-177 , 2010.
  9. Humotion, GenderMotion, http://www.humotion.co.kr/, 2008.
  10. Erumy, "Name Analyze", 2008, http://www.erumy.com/nameAnalyze/eDefault.aspx.
  11. David L. Word, Charles D. Coleman, Robert Nunziata and Robert Kominski, "Demographic Aspects of Surnames from Census 2000", 2000
  12. Statistics Korea, Population Census 2003, http://kostat.go.kr, 2003.
  13. Korean Telephone Directory, "Telephone Directory", http://www.ktdc.co.kr, 1998.
  14. NLTK, "Natural Language Toolkit Development", https://code.google.com/p/nltk/, 2011.
  15. Stven Bird, Ewan Klein, and Edward Loper, "Natural Language Processing with Python", O'reilly, 2009.
  16. S. Moro, R. Laureano, and P. Cortez, "Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology", Proceedings of the European Simulation and Modelling Conference-ESM'2011, Guimaraes, Portugal, pp. 117-121, Oct. 2011.