Browse > Article
http://dx.doi.org/10.3743/KOSIM.2018.35.2.063

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning  

Yuk, JeeHee (연세대학교 일반대학원 문헌정보학과)
Song, Min (연세대학교 문헌정보학과)
Publication Information
Journal of the Korean Society for information Management / v.35, no.2, 2018 , pp. 63-88 More about this Journal
Abstract
This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.
Keywords
document classification; feature selection; text categorization; topic model; deep learning; LDA; Doc2Vec; text mining;
Citations & Related Records
Times Cited By KSCI : 5  (Citation Analysis)
연도 인용수 순위
1 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.   DOI
2 Kim, Dowoo, & Koo, Moung-Wan (2017). Categorization of Korean news articles based on convolutional neural network using Doc2Vec and Word2Vec. Journal of KIISE, 44(7), 742-747.   DOI
3 Kim, Pan-Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of Korean Society for Information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033   DOI
4 Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 384-394). Association for Computational Linguistics.
5 Wadbude, R., Gupta, V., Mekala, D., Jindal, J., & Karnick, H. (2016). User bias removal in fine grained sentiment analysis. arXiv preprint arXiv:1612.06821.
6 Wang, P., Xu, B., Xu, J., Tian, G., Liu, C. L., & Hao, H. (2016). Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing, 174, 806-814. https://doi.org/10.1016/j.neucom.2015.09.096   DOI
7 Wang, S., & Manning, C. D. (2012, July). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2 (pp. 90-94). Association for Computational Linguistics.
8 Wang, Z., & Qian, X. (2008, December). Text categorization based on LDA and SVM. In Computer Science and Software Engineering, 2008 International Conference on (Vol. 1, pp. 674-677). IEEE. https://doi.org/10.1109/csse.2008.571
9 Wei, X., & Croft, W. B. (2006, August). LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 178-185). ACM. https://doi.org/10.1145/1148170.1148204
10 Xing, C., Wang, D., Zhang, X., & Liu, C. (2014, December). Document classification with distributions of word vectors. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (pp. 1-5). IEEE.
11 Lilleberg, J., Zhu, Y., & Zhang, Y. (2015, July). Support vector machines and word2vec for text classification with semantic features. In Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on (pp. 136-140). IEEE. https://doi.org/10.1109/icci-cc.2015.7259377
12 Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information retrieval, 1(1-2), 69-90. https://doi.org/10.1109/apsipa.2014.7041633 http://dx.doi.org/10.3743/KOSIM.2016.33.2.033   DOI
13 Torkkola, K. (2004). Discriminative features for text document classification. Formal Pattern Analysis & Applications, 6(4), 301-308. https://doi.org/10.1007/s10044-003-0196-8   DOI
14 Li, C., Wang, H., Zhang, Z., Sun, A., & Ma, Z. (2016, July). Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 165-174). ACM. https://doi.org/10.1145/2911451.2911499
15 Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015, January). Topical word embeddings. In AAAI (pp. 2418-2424).
16 Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 309-317. https://doi.org/10.1147/rd.14.0309   DOI
17 Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60). https://doi.org/10.3115/v1/p14-5010
18 PubMed Central (2017). Retrieved from https://www.ncbi.nlm.nih.gov/pmc/
19 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
20 Mladenic, D., & Grobelnik, M. (1999). Predicting content from hyperlinks. In Proceedings of the ICML-99 Workshop on Machine Learning in Text Data Analysis, J. Stephan Institute.
21 Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill. 24-51.
22 Tang, D., Qin, B., & Liu, T. (2015). Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 1422-1432). https://doi.org/10.18653/v1/d15-1167
23 Lewis, D. D. (1992, February). Feature selection and feature extraction for text categorization. In Proceedings of the workshop on Speech and Natural Language for Computational Linguistics. https://doi.org/10.3115/1075527.1075574
24 Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science, 26(5), 280-289. https://doi.org/10.1002/asi.4630260504   DOI
25 Hofmann, T. (2017, August). Probabilistic latent semantic indexing. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 211-218). ACM.
26 Hughes, M., Li, I., Kotoulas, S., & Suzumura, T. (2017). Medical text classification using convolutional neural networks. Stud Health Technol Inform, 235, 246-50.
27 Jiang, S., Lewris, J., Voltmer, M., & Wang, H. (2016, April). Integrating rich document representations for text classification. In Systems and Information Engineering Design Symposium (SIEDS), 2016 IEEE (pp. 303-308). IEEE. https://doi.org/10.1109/sieds.2016.7489319
28 Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning (pp. 957-966).
29 John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 121-129). https://doi.org/10.1016/b978-1-55860-335-6.50023-4
30 Koller, D., & Sahami, M. (1996). Toward optimal feature selection. Stanford InfoLab.
31 Lau, J. H., & Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.
32 Le, D. T., & Bernardi, R. (2012, July). Query classification using topic models and support vector machine. In Proceedings of ACL 2012 Student Research Workshop (pp. 19-24). Association for Computational Linguistics.
33 Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(Mar), 1289-1305.
34 Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS), 9(3), 223-248.   DOI
35 Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (pp. 1188-1196).
36 Lee, Jae-Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Library and Information Science Society, 39(2), 123-146.   DOI
37 Atlig, C., Reyyan, K. O. C., & Yigit, T. A. K. A. (2017). Learning-based classification of natural science articles. International Journal of Scientific Research in Information Systems and Engineering (IJSRISE), 2(3), 20-26. http://www.ijsrise.com/index.php/IJSRISE/article/view/52
38 Chung, Yung-Mee. (2012). Research in information retrieval (Rev. ed.). Seoul: Yonsei University Press.
39 Jin, Seol A, & Song, Min (2016). Topic modeling based interdisciplinarity measurement in the informatics related journals. Journal of Korean Society for Information Management, 33(1), 7-32. http://doi.org/10.3743/KOSIM.2016.33.1.007   DOI
40 Choi, Sanghee, & Lee, Jae-Yun (2012). Usability analysis of structured abstracts in journal articles for document clustering. Journal of Korean Society for Information Management, 29(1), 331-349. http://dx.doi.org/10.3743/KOSIM.2012.29.1.331   DOI
41 Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137-1155.
42 Bhushan, S. B., Danti, A., & Fernandes, S. L. (2017). A novel integer representation based approach for classification of text documents. In Proceedings of the International Conference on Data Engineering and Communication Technology (pp. 557-564). Springer, Singapore.
43 Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. http://dx.doi.org/10.1145/2133806.2133826   DOI
44 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.
45 Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998.
46 Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (pp. 160-167). ACM. https://doi.org/10.1145/1390156.1390177