Representative Labels Selection Technique for Document Cluster using WordNet

Kim, Tae-Hoon;Sohn, Mye;

doi:10.7472/jksii.2017.18.2.61

인터넷정보학회논문지 (Journal of Internet Computing and Services)

제18권2호
/
Pages.61-73
/
2017
/
1598-0170(pISSN)
/
2287-1136(eISSN)

한국인터넷정보학회 (Korean Society for Internet Information)

DOI QR Code

문서 클러스터를 위한 워드넷기반의 대표 레이블 선정 방법

Representative Labels Selection Technique for Document Cluster using WordNet

김태훈 ;
손미애

Kim, Tae-Hoon (Department of Industrial Engineering, Sungkyunkwan University) ;
Sohn, Mye (Department of Industrial Engineering, Sungkyunkwan University)

투고 : 2016.10.10
심사 : 2017.02.28
발행 : 2017.04.30

https://doi.org/10.7472/jksii.2017.18.2.61 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

본 연구에서는 문서 클러스터링 결과 도출된 개별 클러스터가 함축하고 있는 의미를 파악하는 데 필요한 어휘들의 정보량을 활용한 문서 클러스터 레이블링(Documents Cluster Labeling) 방법을 제안하였다. 이를 위해, 클러스터에 포함된 어휘들이 해당 클러스터에서 얼마나 중요한 비중을 차지하고 있는지 파악하기 위하여 각 어휘의 출현 빈도와 정보량을 이용한 어휘의 가중치를 계산한 후, 워드넷을 이용하여 클러스터에 포함된 어휘들의 최근접 공통 상위어를 후보 레이블로 식별하였다. 이상의 과정을 거쳐 식별된 후보 레이블의 정보량과 클러스터내에서의 중요도 가중치를 활용해, 해당 클러스터의 의미와 특징을 포괄적으로 표현할 수 있는 대표 레이블을 결정하였다. 본 연구의 우수성을 입증하기 위해 다음과 같은 실험을 수행하였다. 실험은 본 연구에서 제안한 방법에 따라 선정된 레이블과 후보 레이블을 워드넷에 프로젝션한 후, 워드넷상에서 이들 레이블의 위치(깊이)를 확인하였다. 또한 선정된 후보 레이블을 상위어로 갖고 있는 클러스터 내 어휘의 수를 도출하여, 휴리스틱 방법에 따라 선정된 레이블을 전문가가 찾은 대표 레이블과의 비교를 수행하였다. 평가지표로 후보 레이블의 적합성($Suitability_{cl}$)과 대표 레이블의 적절성($Appropriacy_{rl}$)을 활용하였다. 실험 결과, 본 연구에서 제안한 방법을 적용해 문서 클러스터 레이블링을 수행할 경우, 후보 레이블의 적합성의 경우 기존의 방법보다 약간 감소하지만 계산량이 기존 방법의 약 20% 정도로 감소하였으며, 대표 레이블의 적절성의 경우 기존의 방법보다 우수한 결과를 도출하는 것을 확인하였다.

In this paper, we propose a Documents Cluster Labeling method using information content of words in clusters to understand what the clusters imply. To do so, we calculate the weight and frequency of the words. These two measures are used to determine the weight among the words in the cluster. As a nest step, we identify the candidate labels using the WordNet. At this time, the candidate labels are matched to least common hypernym of the words in the cluster. Finally, the representative labels are determined with respect to information content of the words and the weight of the words. To prove the superiority of our method, we perform the heuristic experiment using two kinds of measures, named the suitability of the candidate label ($Suitability_{cl}$) and the appropriacy of representative label ($Appropriacy_{rl}$). In applying the method proposed in this research, in case of suitability of the candidate label, it decreases slightly compared with existing methods, but the computational cost is about 20% of the conventional methods. And we confirmed that appropriacy of the representative label is better results than the existing methods. As a result, it is expected to help data analysts to interpret the document cluster easier.

키워드

참고문헌

Q. Mei, X. Shen, and C. Zhai, "Automatic labeling of multinomial topic models," In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 490-499, 2007. https://doi.org/10.1145/1281192.1281246
R. Mihalcea and P. Tarau, "TextRank: Bringing order into texts," Association for Computational Linguistics, 2004. http://digital.library.unt.edu/ark:/67531/metadc30962/
W. Lu, Q. Cheng and C. Lioma, "Fixed versus dynamic co-occurrence windows in TextRank term weights for information retrieval," In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 1079-1080, 2012. https://doi.org/10.1145/2348283.2348478
F. Role and M. Nadif, "Beyond cluster labeling: Semantic interpretation of clusters' contents using a graph representation," Knowledge-Based Systems, vol. 56, pp. 141-155, 2014. http://dx.doi.org/10.1016/j.knosys.2013.11.005
C. T. Nguyen, X. H. Phan, S. Horiguchi, T. T. Nguyen and Q. T. Ha, "Web search clustering and labeling with hidden topics," ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, issue. 3, pp. 12, 2009. https://doi.org/10.1145/1568292.1568295
Z. S. Syed, T. Finin and A. Joshi, "Wikipedia as an Ontology for Describing Documents," In ICWSM, 2008. http://www.aaai.org/Papers/ICWSM/2008/ICWSM08-024.pdf
D. Carmel, H. Roitman and N. Zwerdling, "Enhancing cluster labeling using Wikipedia," In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 139-146, 2009. https://doi.org/10.1145/1571941.1571967
Z. Li, J. Li, Y. Liao, S. Wen and J. Tang, "Labeling clusters from both linguistic and statistical perspectives: A hybrid approach," Knowledge-Based Systems, vol. 76, pp. 219-227, 2015. http://dx.doi.org/10.1016/j.knosys.2014.12.019
Y. H. Tseng, "Generic title labeling for clustered documents," Expert Systems with Applications, vol. 37, issue. 3, pp. 2247-2254, 2010. http://dx.doi.org/10.1016/j.eswa.2009.07.048
C. Bouras and V. Tsogkas, "A clustering technique for news articles using WordNet," Knowledge-Based Systems, vol. 36, pp. 115-128, 2012. http://dx.doi.org/10.1016/j.knosys.2012.06.015
W. H. Gomaa and A. A. Fahmy, "A survey of text similarity approaches," International Journal of Computer Applications, vol. 68, no. 13, pp. 13-18, 2013. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.403.5446&rep=rep1&type=pdf https://doi.org/10.5120/11638-7118
D. Sanchez, M. Batet, D. Isern and A. Valls, "Ontology-based semantic similarity: A new feature-based approach," Expert Systems with Applications, vol. 39, issue. 9, pp. 7718-7728, 2012. http://dx.doi.org/10.1016/j.eswa.2012.01.082
G. A. Miller, "WordNet: a lexical database for English," Communications of the ACM, vol. 38, issue. 11, pp. 39-41, 1995. https://doi.org/10.1145/219717.219748
T. Pedersen, S. Patwardhan and J. Michelizzi, "WordNet: Similarity: measuring the relatedness of concepts," In Demonstration papers at HLT-NAACL 2004, pp. 38-41, 2004. Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1614037
WordNet, "A lexical database for the English language," Cognitive Science Laboratory, Princeton University. 2004. http://wordnet.princeton.edu
P. Treeratpituk and J. Callan, "Automatically labeling hierarchical clusters," In Proceedings of the 2006 international conference on Digital government research, pp. 167-176, 2006. https://doi.org/10.1145/1146598.1146650
H. Anaya-Sanchez, A. Pons-Porrata and R. Berlanga-Llavori, "A new document clustering algorithm for topic discovering and labeling," In Iberoamerican Congress on Pattern Recognition, pp. 161-168, 2008. https://link.springer.com/chapter/10.1007/978-3-540-85920-8_20
T. Okuoka, T. Takahashi, D. Deguchi, I. Ide and H. Murase, "Labeling news topic threads with Wikipedia entries," 11th IEEE International Symposium on Multimedia, pp. 501-504, 2009. https://doi.org/10.1109/ISM.2009.67
X. L. Mao, Z. Y. Ming, Z. J. Zha, T. S. Chua, H. Yan and X. Li, "Automatic labeling hierarchical topics," In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 2383-2386, 2012. https://doi.org/10.1145/2396761.2398646
J. H. Lau, K. Grieser, D. Newman and T. Baldwin, "Automatic labelling of topic models," In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1536-1545, 2011. http://dl.acm.org/citation.cfm?id=2002658
I. Hulpus, C. Hayes, M. Karnstedt and D. Greene, "Unsupervised graph-based topic labelling using dbpedia," In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474, 2013. https://doi.org/10.1145/2433396.2433454
H. Roitman, S. Hummel and M. Shmueli-Scheuer, "A fusion approach to cluster labeling," In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 883-886, 2014. https://doi.org/10.1145/2600428.2609465
A. Panchenko and O. Morozova, "A study of hybrid similarity measures for semantic relation extraction," In Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, pp. 10-18, 2012. http://dl.acm.org/citation.cfm?id=2388634
S. Hingmire, S. Chougule, G. K. Palshikar and S. Chakraborti, "Document classification by topic labeling," In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 877-880, 2013. https://doi.org/10.1145/2484028.2484140
T. H. Kim, "A study of Document Cluster Labeling using Information Content of words", Master Dissertation of Sungkyunkwan Unversity, 2016. http://dcollection.skku.edu/jsp/common/DcLoOrgPer.jsp?sItemId=000000096202

피인용 문헌

RGB-D 정보를 이용한 객체 탐지 기반의 신체 키포인트 검출 방법 vol.18, pp.6, 2017, https://doi.org/10.7472/jksii.2017.18.6.85

인터넷정보학회논문지 (Journal of Internet Computing and Services)

문서 클러스터를 위한 워드넷기반의 대표 레이블 선정 방법

Representative Labels Selection Technique for Document Cluster using WordNet

초록

키워드

참고문헌

피인용 문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)