• 제목/요약/키워드: Data Clustering

검색결과 2,725건 처리시간 0.031초

A Determination of an Optimal Clustering Method Based on Data Characteristics

  • Kim, Jeong-Hun;Yoo, Kwan-Hee;Nasridinov, Aziz
    • 예술인문사회 융합 멀티미디어 논문지
    • /
    • 제7권8호
    • /
    • pp.305-314
    • /
    • 2017
  • Clustering is a method that collects data objects into groups based on their similary. Performance of the state-of-the-art clustering methods is different according to the data characteristics. There have been numerous studies that performed experiments to compare the accuracy of the state-of-the-art clustering methods by applying various kinds of datasets. A common problem of these studies is that they only consider clustering algorithms that yield the most accurate results for a particular dataset. They do not consider what factors affect the execution time of each clustering method and how they are affected. Nevertheless, execution time is an important factor in clustering performance if there is no significant difference in accuracy. In order to solve the problems of the existing research, through a series of experiments using various types of datasets, we compare the accuracy of four representative clustering methods. In addition, we perform practical clustering performance comparisons by deriving time complexity and identifying factors that influences to its performance.

경영사례를 이용한 군집화 유효성 지수의 성능비교 (Performance Comparison of Clustering Validity Indices with Business Applications)

  • 이수현;정영선;김재윤
    • 한국경영과학회지
    • /
    • 제41권2호
    • /
    • pp.17-33
    • /
    • 2016
  • Clustering is one of the leading methods to analyze big data and is used in many different fields. This study deals with Clustering Validity Index (CVI) to verify the effectiveness of clustering results. We compare the performance of CVIs with business applications of various field. In this study, the used CVIs for comparing performance are DU, CH, DB, SVDU, SVCH, and SVDB. The first three CVIs are well-known ones in the existing research and the last three CVIs are based on support vector data description. It has been verified with outstanding performance and qualified as the application ability of CVIs based on support vector data description.

Clustering Algorithm by Grid-based Sampling

  • 박희창;유지현
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 한국데이터정보과학회 2003년도 춘계학술대회
    • /
    • pp.97-108
    • /
    • 2003
  • Cluster analysis has been widely used in many applications, such that pattern analysis or recognition, data analysis, image processing, market research on on-line or off-line and so on. Clustering can identify dense and sparse regions among data attributes or object attributes. But it requires many hours to get clusters that we want, because of clustering is more primitive, explorative and we make many data an object of cluster analysis. In this paper we propose a new method of clustering using sample based on grid. It is more fast than any traditional clustering method and maintains its accuracy. It reduces running time by using grid-based sample. And other clustering applications can be more effective by using this methods with its original methods.

  • PDF

A Clustered Dwarf Structure to Speed up Queries on Data Cubes

  • Bao, Yubin;Leng, Fangling;Wang, Daling;Yu, Ge
    • Journal of Computing Science and Engineering
    • /
    • 제1권2호
    • /
    • pp.195-210
    • /
    • 2007
  • Dwarf is a highly compressed structure, which compresses the cube by eliminating the semantic redundancies while computing a data cube. Although it has high compression ratio, Dwarf is slower in querying and more difficult in updating due to its structure characteristics. We all know that the original intention of data cube is to speed up the query performance, so we propose two novel clustering methods for query optimization: the recursion clustering method which clusters the nodes in a recursive manner to speed up point queries and the hierarchical clustering method which clusters the nodes of the same dimension to speed up range queries. To facilitate the implementation, we design a partition strategy and a logical clustering mechanism. Experimental results show our methods can effectively improve the query performance on data cubes, and the recursion clustering method is suitable for both point queries and range queries.

A Study on K -Means Clustering

  • Bae, Wha-Soo;Roh, Se-Won
    • Communications for Statistical Applications and Methods
    • /
    • 제12권2호
    • /
    • pp.497-508
    • /
    • 2005
  • This paper aims at studying on K-means Clustering focusing on initialization which affect the clustering results in K-means cluster analysis. The four different methods(the MA method, the KA method, the Max-Min method and the Space Partition method) were compared and the clustering result shows that there were some differences among these methods, especially that the MA method sometimes leads to incorrect clustering due to the inappropriate initialization depending on the types of data and the Max-Min method is shown to be more effective than other methods especially when the data size is large.

정량적 자료에 대한 효과적인 군집화 과정 및 사용 후 핵연료의 분류에의 적용 (An Effective Clustering Procedure for Quantitative Data and Its Application for the Grouping of the Reusable Nuclear Fuel)

  • 강금석;윤복식;이용주
    • 산업공학
    • /
    • 제15권2호
    • /
    • pp.182-188
    • /
    • 2002
  • Clustering is widely used in various fields in order to investigate structural characteristics of the given data. One of the main tasks of clustering is to partition a set of objects into homogeneous groups for the purpose of data reduction. In this paper a simple but computationally efficient clustering procedure is devised and some statistical techniques to validate its clustered results are discussed. In the given procedure, the proper number of clusters and the clustered groups can be determined simultaneously. The whole procedure is applied to a practical clustering problem for the classification of reusable fuels in nuclear power plants.

A Task Scheduling Method after Clustering for Data Intensive Jobs in Heterogeneous Distributed Systems

  • Hajikano, Kazuo;Kanemitsu, Hidehiro;Kim, Moo Wan;Kim, Hee-Dong
    • Journal of Computing Science and Engineering
    • /
    • 제10권1호
    • /
    • pp.9-20
    • /
    • 2016
  • Several task clustering heuristics are proposed for allocating tasks in heterogeneous systems to achieve a good response time in data intensive jobs. However, one of the challenging problems is the process in task scheduling after task allocation by task clustering. We propose a task scheduling method after task clustering, leveraging worst schedule length (WSL) as an upper bound of the schedule length. In our proposed method, a task in a WSL sequence is scheduled preferentially to make the WSL smaller. Experimental results by simulation show that the response time is improved in several task clustering heuristics. In particular, our proposed scheduling method with the task clustering outperforms conventional list-based task scheduling methods.

Comprehensive review on Clustering Techniques and its application on High Dimensional Data

  • Alam, Afroj;Muqeem, Mohd;Ahmad, Sultan
    • International Journal of Computer Science & Network Security
    • /
    • 제21권6호
    • /
    • pp.237-244
    • /
    • 2021
  • Clustering is a most powerful un-supervised machine learning techniques for division of instances into homogenous group, which is called cluster. This Clustering is mainly used for generating a good quality of cluster through which we can discover hidden patterns and knowledge from the large datasets. It has huge application in different field like in medicine field, healthcare, gene-expression, image processing, agriculture, fraud detection, profitability analysis etc. The goal of this paper is to explore both hierarchical as well as partitioning clustering and understanding their problem with various approaches for their solution. Among different clustering K-means is better than other clustering due to its linear time complexity. Further this paper also focused on data mining that dealing with high-dimensional datasets with their problems and their existing approaches for their relevancy

항목 유사도를 고려한 트랜잭션 클러스터링 (Transactions Clustering based on Item Similarity)

  • 이상욱;김재련
    • 지능정보연구
    • /
    • 제9권1호
    • /
    • pp.179-193
    • /
    • 2003
  • 군집화(clustering)는 주어진 객체들 중에서 유사한 것들을 몇몇의 집단으로 그룹화 하여 각 집단의 성격을 파악하는데, 실제적으로 각 객체가 유사한지 그렇지 않은지를 측정할 수 있는 도구가 필요하다. 기존의 군집화에서 객체간에 유사하다는 의미는 각 군집(cluster)안에 있는 객체들이 같은 속성 값이 많으면 많을수록 객체간에 유사성이 높아 유사도가 높은 객체끼리 군집을 이루게 된다는 것을 의미했다. 그 중에서도 범주형 속성을 갖는 군집화는 같은 속성 값이면 1, 서로 다르면 0으로 표현하여 유사성을 측정하는 방법이다. 제안된 알고리듬은 속성 값을 0과1로만 표현하는 것에 대한 문제점을 제시하고 서로 다른 속성이라도 속성간에 친밀한 관계가 있다는 개념을 도입하여 어느 정도 유사한 지를 보여준다. 같은 객체간에 같은 값을 갖는 속성이 하나로 없더라도 구해진 유사도에 의해 유사한 개체끼리는 하나의 군집이 될 수 있는 알고리듬을 만든 후 그 군집에 속해 있는 고객들의 니즈와 구매 선호도에 따라 적절한 타겟 마케팅(Target Marketing)을 할 수 있다.

  • PDF

주변 차량 위치 좌표의 고속 클러스터링을 위한 휴리스틱 알고리즘 (Heuristic Algorithm for High-Speed Clustering of Neighbor Vehicular Position Coordinate)

  • 최윤호;유승호;서승우
    • 한국통신학회논문지
    • /
    • 제39C권4호
    • /
    • pp.343-350
    • /
    • 2014
  • 분할 계층적 클러스터링(Divisive Hierarchical Clustering)은 하나의 클러스터에서 시작하여 각각의 데이터가 독립된 클러스터에 속할 때까지 각 클러스터를 분할하고 분할된 클러스터 간에 데이터를 이동하는 과정을 반복 수행한다. 하지만, 이러한 일련의 재귀적 호출 과정에서 입력 데이터가 임의적으로 선택되는 경우, 클러스터 내 데이터의 많은 이동을 야기할 수 있다. 이로 인해 주변 차량의 위치를 추정하여 수집된 위치 좌표 정보를 고속으로 클러스터링 할 필요가 있는 로컬 맵 생성 과정에서 사용하기 어렵다는 단점이 있다. 본 논문에서는 주변 차량 위치 추정 과정에서 차량의 주행 방향 정보를 활용하여 분할된 클러스터를 구성하는 데이터의 임의성을 제거함으로써, 클러스터링 연산 속도를 평균 40% 가량 향상시킬 수 있는 새로운 고속의 분할 계층적 클러스터링 방법을 제안한다.