[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7838/jsebs.2014.19.3.107

A Semantic Text Model with Wikipedia-based Concept Space

Kim, Han-Joon (School of Electrical and Computer Engineering, University of Seoul)
Chang, Jae-Young (Department of Computer Engineering, Hansung University)

Publication Information

The Journal of Society for e-Business Studies / v.19, no.3, 2014 , pp. 107-123 More about this Journal

Abstract

Current text mining techniques suffer from the problem that the conventional text representation models cannot express the semantic or conceptual information for the textual documents written with natural languages. The conventional text models represent the textual documents as bag of words, which include vector space model, Boolean model, statistical model, and tensor space model. These models express documents only with the term literals for indexing and the frequency-based weights for their corresponding terms; that is, they ignore semantical information, sequential order information, and structural information of terms. Most of the text mining techniques have been developed assuming that the given documents are represented as 'bag-of-words' based text models. However, currently, confronting the big data era, a new paradigm of text representation model is required which can analyse huge amounts of textual documents more precisely. Our text model regards the 'concept' as an independent space equated with the 'term' and 'document' spaces used in the vector space model, and it expresses the relatedness among the three spaces. To develop the concept space, we use Wikipedia data, each of which defines a single concept. Consequently, a document collection is represented as a 3-order tensor with semantic information, and then the proposed model is called text cuboid model in our paper. Through experiments using the popular 20NewsGroup document corpus, we prove the superiority of the proposed text model in terms of document clustering and concept clustering.

Keywords

Text Representation Model; Text Mining; Wikipedia; Text Cuboid; Concept Space; Vector Space; Tensor Space;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Antonellis, I. and Gallopoulos, E., Exploring term-document matrices from matrix models in text mining, SIAM Text Mining Workshop, SIAM Conference on Data Mining, 2006.
2	Berry, M. W., Survey of text mining : Clustering, Classification, and Retrieval, Springer-Verlag, 2003.
3	Cai, D., He, X., Wen, J. R., Han, J., and Ma, W. Y., Support Tensor Machines for Text Categorization, Technical Report UIUCDCS-R-2006-2714, 2006.
4	Cavnar, W. B. and Trenkle, J. M., N-Gram-Based Text Categorization, Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161-175, 1994.
5	http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaosar.pdf.
6	Faulkner, A., Automated Classification of Stance in Student Essays : An Approach Using Stance Target Information and the Wikipedia Link-Based Measure, Science, Vol. 376, No. 12, p. 86, 2014.
7	Gabrilovich, E. and Markovitch, S., Feature generation for text categorization using world knowledge, Proceedings of International Joint Conferences on Artificial Intelligence, pp. 1048-1053, 2005.
8	Howard, T. and Croft, W. B., Inference networks for document retrieval, Proceedings of International ACM SIGIR, pp. 1-24, 1989.
9	http://www.statsoft.com/textbook/text-mining/.
10	Jiang, C., Coenen, F., Sanderson, R., and Zito, M., Text Classification Using Graph Mining-Based Feature Extraction, Knowledge-Based Systems, Vol. 23, No. 4, pp. 302-308, 2009.
11	Kimbrough, S., Executive Briefing : Text Mining for Business Intelligence, INSEAD-UNILEVER workshop, 2006.
12	Lancaster, F. W. and Fayen, E. G., Information Retrieval On-Line, Melville Publishing Co., 1973.
13	Maron, M. and Kuhns, J., On relevance, probabilistic indexing and information retrieval, Journal of the Association for Computing Machinery, Vol. 7, pp. 216-244, 1960. DOI
14	Martinez, D. and Baldwin, T., Word sense disambiguation for event trigger word detection, Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics, pp. 41-48, 2010.
15	Navigli, R., Word sense disambiguation : A survey, ACM Computing Surveys, Vol. 41, No. 2, pp. 1-69, 2009.
16	Ribeiro, B. and Muntz, R. A., Belief Network Model for IR, Proceedings of International ACM SIGIR, pp. 253-260, 1996.
17	Salton, G., Wong, A., and Yang, C. S., A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, No. 11, pp. 613-620, 1975. DOI ScienceOn
18	Schenker, A., Last, M., Bunke, H., and Kandel, A., Classification of Web Documents Using a Graph Model, Proceedings of 7th International Conference on Document Analysis and Recognition, pp. 240-244, 2003.
19	Sui, Z., Zhao, Q., and Liu, Y., Inducting Concept Hierarchies from Text based on FCA, Proceedings of Fourth International Conference on Innovative Computing, Information and Control, pp. 1080-1083, 2009.
20	Witten, I. H., Text Mining, http://www.cs.waikato.ac.nz/-ihw/papers/04-IHW-Textmining.pdf.
21	Tamara, G. K. and Bader, B., Tensor Decompositions and Applications, SIAM Review, Vol. 51, No. 3, pp. 455-500, 2009. DOI ScienceOn
22	The Value and Benefits of Text Mining, JISC Digital Infrastructure, 2012.
23	Wu, J., Xuan, Z., and Pan, D., Enhancing Text Representation for Classification Tasks with Semantic Graph Structures, International Journal of Innovative Computing, Information Control, Vol. 7, No. 5(B), pp. 2689-2698, 2011.
24	Yeon, J., Shim, J., and Lee, S. G., Outlier Detection Techniques for Biased Opinion Discovery, Journal of Society for e-Business Studies, Vol. 18, No. 4, pp. 315-326, 2013. 과학기술학회마을 DOI
25	Zhang, B., Yan, J., Chen, Z., Liu, W., Bai, F., and Chien, L., Text representation : from vector to tensor, Fifth IEEE International Conference on Data Mining, pp. 725-728, 2005.

4	Kee-Joo Hong. (2016) The Journal of Society for e-Business Studies A Tensor Space Model based Semantic Search Technique / 21 (4) , 1
4	Ga-hee Lee. (2015) The Journal of Society for e-Business Studies Automated Development of Rank-Based Concept Hierarchical Structures using Wikipedia Links / 20 (4) , 61
2	(2018) Journal of Information Technology Research Multidimensional Text Warehousing for Automated Text Classification / 11 (2) , 168
3	(2014) 지능정보연구 한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구 / 24 (3) , 221

KSCI

A Semantic Text Model with Wikipedia-based Concept Space 위키피디어 기반 개념 공간을 가지는 시멘틱 텍스트 모델

A Semantic Text Model with Wikipedia-based Concept Space