Browse > Article
http://dx.doi.org/10.13088/jiis.2011.17.1.111

Hierarchical Overlapping Clustering to Detect Complex Concepts  

Hong, Su-Jeong (LG U+)
Choi, Joong-Min (Department of Computer Science and Engineering, Hanyang University)
Publication Information
Journal of Intelligence and Information Systems / v.17, no.1, 2011 , pp. 111-125 More about this Journal
Abstract
Clustering is a process of grouping similar or relevant documents into a cluster and assigning a meaningful concept to the cluster. By this process, clustering facilitates fast and correct search for the relevant documents by narrowing down the range of searching only to the collection of documents belonging to related clusters. For effective clustering, techniques are required for identifying similar documents and grouping them into a cluster, and discovering a concept that is most relevant to the cluster. One of the problems often appearing in this context is the detection of a complex concept that overlaps with several simple concepts at the same hierarchical level. Previous clustering methods were unable to identify and represent a complex concept that belongs to several different clusters at the same level in the concept hierarchy, and also could not validate the semantic hierarchical relationship between a complex concept and each of simple concepts. In order to solve these problems, this paper proposes a new clustering method that identifies and represents complex concepts efficiently. We developed the Hierarchical Overlapping Clustering (HOC) algorithm that modified the traditional Agglomerative Hierarchical Clustering algorithm to allow overlapped clusters at the same level in the concept hierarchy. The HOC algorithm represents the clustering result not by a tree but by a lattice to detect complex concepts. We developed a system that employs the HOC algorithm to carry out the goal of complex concept detection. This system operates in three phases; 1) the preprocessing of documents, 2) the clustering using the HOC algorithm, and 3) the validation of semantic hierarchical relationships among the concepts in the lattice obtained as a result of clustering. The preprocessing phase represents the documents as x-y coordinate values in a 2-dimensional space by considering the weights of terms appearing in the documents. First, it goes through some refinement process by applying stopwords removal and stemming to extract index terms. Then, each index term is assigned a TF-IDF weight value and the x-y coordinate value for each document is determined by combining the TF-IDF values of the terms in it. The clustering phase uses the HOC algorithm in which the similarity between the documents is calculated by applying the Euclidean distance method. Initially, a cluster is generated for each document by grouping those documents that are closest to it. Then, the distance between any two clusters is measured, grouping the closest clusters as a new cluster. This process is repeated until the root cluster is generated. In the validation phase, the feature selection method is applied to validate the appropriateness of the cluster concepts built by the HOC algorithm to see if they have meaningful hierarchical relationships. Feature selection is a method of extracting key features from a document by identifying and assigning weight values to important and representative terms in the document. In order to correctly select key features, a method is needed to determine how each term contributes to the class of the document. Among several methods achieving this goal, this paper adopted the $x^2$�� statistics, which measures the dependency degree of a term t to a class c, and represents the relationship between t and c by a numerical value. To demonstrate the effectiveness of the HOC algorithm, a series of performance evaluation is carried out by using a well-known Reuter-21578 news collection. The result of performance evaluation showed that the HOC algorithm greatly contributes to detecting and producing complex concepts by generating the concept hierarchy in a lattice structure.
Keywords
Hierarchical Overlapping Clustering; Complex Concept Detection; Feature Selection; Concept Labeling;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Liu, B., Web Data Mining : Exploring Hyperlinks, Contents, and Usage Data, Springer, 2007.
2 Yeh, J. and S. Sie, "Towards Automatic Concept Hierarchy Generation for Specific Knowledge Network", Lecture Notes in Artificial Intelligence(LNAI), Vol.4031 (2006), 982-989.
3 Cleuziou, G., "An Extended Version of the Kmeans Method for Overlapping Clustering", Proc. 19th Intl. Conf. on Pattern Recognition (ICPR 2008), (2008), 1-4.
4 Gath, I. and A. B. Geva, "Unsupervised Optimal Fuzzy Clustering", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.11, No.7(1989), 773-780.   DOI   ScienceOn
5 Jain, A. K., M. N. Murty and P. J. Flynn, "Data Clustering : A Review", ACM Computing Surveys, Vol.31, No.3(1999), 264-323.   DOI   ScienceOn
6 Jonyer, I., D. J. Cook and L. B. Holder, "Graph- Based Hierarchical Conceptual Clustering", Journal of Machine Learning Research, Vol.2(2002), 19-43.
7 Lavine, B. K., "Clustering and Classification of Analytical Data", in R. A. Meyers (ed.), Encyclopedia of Analytical Chemistry, (2000), 1-21.
8 Lewis, D., "Reuters-21578 Text Categorization Test Collection", 2004.
9 Likas, A., N. Viassis and J. J. Verbeek, "The Global K-means Clustering Algorithm", Pattern Recognition, Vol.36, No.2(2003), 451-461.   DOI   ScienceOn
10 http://www.daviddlewis.com/resources/testcollecions/reuters21578/.
11 Chuang, S. and L. Chien, "A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments, Proc. 13th ACM Intl. Conf. on Information and Knowledge Management (CIKM'04) (2004), 127-136.
12 Baeza-Yates, R. and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.