[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13088/jiis.2019.25.3.019

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents

Park, Jongin (Graduate School of Business IT, Kookmin University)
Kim, Namgyu (School of Management Information Systems, Kookmin University)

Publication Information

Journal of Intelligence and Information Systems / v.25, no.3, 2019 , pp. 19-41 More about this Journal

Abstract

According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

Keywords

Document Embedding; Multi-Vector Document Embedding; Word Embedding; Text Mining;

Citations & Related Records

Reference

1	Kenter, T. and M. Rijke, "Short Text Similarity with Word Embedding," Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, (2015), 1411-1420.
2	Kim, N., D. Lee, H. Choi, and W. X. S. Wong, "Investigations on Techniques and Applications of Text Analytics," The Journal of The Korean Institute of Communication Sciences, Vol.42, No.2(2017), 471-492. DOI
3	Kim, Y., "Convolutional Neural Networks for Sentence Classification," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), (2014), 1746-1751.
4	Kiros, R., Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler, "Skip-Thought Vectors," Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol.2, (2015), 3294-3302.
5	Lai, S., L. Xu, K. Liu, and J. Zhao, "Recurrent Convolutional Neural Network for Text Classification," Proceedings of the 29th AAAI Conference on Artificial Intelligence, (2015), 2267-2273.
6	Liu, J., W. Chang, Y. Wu, and Y. Yang, "Deep Learning for Extreme Multi-label Text Classification," Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, (2017), 115-124.
7	Mikolov, T., A. Deoras, D. Povey, L. Burget, and J. Cernocky, "Strategies for Training Large Scale Neural Network Language Models," 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, (2011), 196-201.
8	Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean, "Distributed representations of words and phrases and their compositionality," Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol.2, (2013), 3111-3119.
9	Hotho, A., A. Nurnberger, and G. Paass, "A Brief Survey of Text Mining," LDV-Forum, Vol.20, No.1(2005), 19-62.
10	Salton, G., A. Wong, and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communications of the ACM, Vol.18, No.11(1975), 613-620. DOI
11	Quoc, L. and T. Mikolov, "Distributed Representations of Sentences and Documents," Proceedings of the 31st International Conference on Machine Learning, Vol.32, (2014), 1188-1196.
12	Tan, A., "Text Mining: The State of the Art and the Challenges," Proceedings of the Pacific Asia Conference on Knowledge Discovery and Data Mining PAKDD'99 workshop on Knowledge Discovery from Advanced Databases, (1999), 65-70.
13	Turian, J., L. Ratinov, and Y. Bengio, "Word Representations: A Simple and General Method for Semi-Supervised Learning," Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, (2010), 384-394.
14	Yu, H., S. Lee, and Y. Ko, "Incremental Clustering and Multi-Document Summarization for Issue Analysis based on Real-time News," Journal of Korean Institute of Information Scientists and Engineers, Vol.46, No.4(2019), 355-362.
15	Hinton, G. E., "Learning Distributed Representations of Concepts," Proceedings of the 8th Annual Conference of the Cognitive Science Society, Vol.1, (1986), 1-12.
16	Aggarwal, C. C. and C. Zhai, Mining Text Data, Springer, Boston, 2012.
17	Bengio, Y., R. Ducharme, P. Vincent, and C. Janvin, "A Neural Probabilistic Language Model," The Journal of Machine Learning Research, Vol.3, (2003), 1137-1155.
18	Firth, J. R., "A Synopsis of Linguistic Theory 1930-1955", Studies in Linguistic Analysis, Blackwell, Oxford, 1957.

KSCI

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents 복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents