Browse > Article
http://dx.doi.org/10.3743/KOSIM.2020.37.1.001

A Study on Book Categorization in Social Sciences Using kNN Classifiers and Table of Contents Text  

Lee, Yong-Gu (계명대학교 문헌정보학과)
Publication Information
Journal of the Korean Society for information Management / v.37, no.1, 2020 , pp. 1-21 More about this Journal
Abstract
This study applied automatic classification using table of contents (TOC) text for 6,253 social science books from a newly arrived list collected by a university library. The k-nearest neighbors (kNN) algorithm was used as a classifier, and the ten divisions on the second level of the DDC's main class 300 given to books by the library were used as classes (labels). The features used in this study were keywords extracted from titles and TOCs of the books. The TOCs were obtained through the OpenAPI from an Internet bookstore. As a result, it was found that the TOC features were good for improving both classification recall and precision. The TOC was shown to reduce the overfitting problem of imbalanced data with its rich features. Law and education have high topic specificity in the field of social sciences, so the only title features can bring good classification performance in these fields.
Keywords
Table of contents; kNN classifier; book categorization; DDC (Dewey Decimal Classification);
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Lee, Yong-Gu (2013). A study on feature selection for kNN classifier using document frequency and collection frequency. Journal of Korean Library and Information Science Society, 44(1), 27-47. http://dx.doi.org/10.16981/kliss.44.1.201303.27
2 Lee, Yong-Gu (2019). A study on the statistical characteristics for table of contents text of the books in social sciences field. Journal of the Korean Society for Information Management, 36(2), 255-273. http://dx.doi.org/10.3743/KOSIM.2019.36.2.255   DOI
3 Lee, Jae Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123   DOI
4 Altman, N. S. (1992). An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. The American Statistician, 46(3), 175-185. http://dx.doi.org/10.1080/00031305.1992.10475879   DOI
5 Azam, N., & Yao, J. (2012). Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications, 39(5), 4760-4768.   DOI
6 Campos, G. O., Zimek, A., Sander, J., Campello, R. J. G. B., Micenkova, B., Schubert, E., ... & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30(4), 891-927. https://doi.org/10.1007/s10618-015-0444-8   DOI
7 Chercourt, M., & Marshall, L. (2013). Making keywords work: Connecting patrons to resources through enhanced bibliographic records. Technical Services Quarterly, 30(3), 285-295. http://dx.doi.org/10.1080/07317131.2013.785786   DOI
8 Dillon, M., & Wenzel, P. (1990). Retrieval effectiveness of enhanced bibliographic records. Library Hi Tech, 8(3), 43-46. https://doi.org/10.1108/eb047797   DOI
9 Frank, E., & Paynter, G. W. (2004). Predicting library of congress classifications from library of congress subject headings. Journal of the American Society for Information Science and Technology, 55(3), 214-227. https://doi.org/10.1002/asi.10360   DOI
10 Godby, C. J., & Stuler, J. (2003). The library of congress classification as a knowledge base for automatic subject categorization. In Subject Retrieval in a Networked Environment: Proceedings of the IFLA Satellite Meeting, Dublin, OH, 14-16.
11 Larson, R. R. (1992). Experiments in automatic library of congress classification. Journal of the American Society for Information Science, 43(2), 130-148. https://doi.org/10.1002/(SICI)1097-4571(199203)43:2<130::AID-ASI3>3.0.CO;2-S   DOI
12 Pappas, E., & Herendeen, A. (2000). Enhancing bibliographic records with tables of contents derived from OCR technologies at the american museum of natural history library. Cataloging & Classification Quarterly, 29(4), 61-72. http://dx.doi.org/10.1300/J104v29n04_05   DOI
13 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.
14 Van Orden, R. (1990). Content-enriched access to electronic information: Summaries of selected research. Library Hi Tech, 8(3), 27-32. https://doi.org/10.1108/eb047795   DOI
15 Wang, J. (2009). An extensive study on automated dewey decimal classification. Journal of the American Society for Information Science and Technology, 66(11), 2269-2286. https://doi.org/10.1002/asi.21147   DOI
16 Yang, Y., & Lin, X. (1999). A re-examination of text categorization methods, In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in the information retrieval(1999), 42-49.
17 Winke, R. C. (1999). An analysis of tables of contents in recent english-language books. Library Resources & Technical Services, 43(1), 14-27. http://dx.doi.org/10.5860/lrts.43n1.14   DOI
18 Chung, Young-Mee (2012). Research in information retrieval. Seoul: Yonsei University Press.