• Title/Summary/Keyword: Determining the Number of Clusters

Search Result 26, Processing Time 0.027 seconds

Determining on Model-based Clusters of Time Series Data (시계열데이터의 모델기반 클러스터 결정)

  • Jeon, Jin-Ho;Lee, Gye-Sung
    • The Journal of the Korea Contents Association
    • /
    • v.7 no.6
    • /
    • pp.22-30
    • /
    • 2007
  • Most real word systems such as world economy, stock market, and medical applications, contain a series of dynamic and complex phenomena. One of common methods to understand these systems is to build a model and analyze the behavior of the system. In this paper, we investigated methods for best clustering over time series data. As a first step for clustering, BIC (Bayesian Information Criterion) approximation is used to determine the number of clusters. A search technique to improve clustering efficiency is also suggested by analyzing the relationship between data size and BIC values. For clustering, two methods, model-based and similarity based methods, are analyzed and compared. A number of experiments have been performed to check its validity using real data(stock price). BIC approximation measure has been confirmed that it suggests best number of clusters through experiments provided that the number of data is relatively large. It is also confirmed that the model-based clustering produces more reliable clustering than similarity based ones.

A Method for Determining the Number of Clusters in Data Clustering (데이터 클러스터링에서 클러스터 수 결정방안)

  • Lee, Byung-Soo;Hong, Jiwon;Kim, Sang-Wook
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2011.11a
    • /
    • pp.1268-1269
    • /
    • 2011
  • 데이터마이닝 분야에서는 주어진 공간상에 분포되어있는 데이터들을 분석위해 다양한 클러스터링 알고리즘이 존재한다. 그러나 대부분의 클러스터링 알고리즘에서는 클러스터 전체 개수를 미리 요구한다. 이 때문에 클러스터링 알고리즘에서 클러스터 전체개수를 미리 알아내는 것은 매우 중요하다. 본 논문에서는 데이터에 분포하는 클러스터들의 개수를 데이터의 그래프 모델을 이용한 분석으로 찾아내는 방법을 제안한다.

Neighborhood Selection with Intrinsic Partitions (데이터 분포에 기반한 유사 군집 선택법)

  • Kim, Kye-Hyeon;Choi, Seung-Jin
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.10c
    • /
    • pp.428-432
    • /
    • 2007
  • We present a novel method for determining k nearest neighbors, which accurately recognizes the underlying clusters in a data set. To this end, we introduce the "tiling neighborhood" which is constructed by tiling a number of small local circles rather than a single circle, as existing neighborhood schemes do. Then we formulate the problem of determining the tiling neighborhood as a minimax optimization, leading to an efficient message passing algorithm. For several real data sets, our method outperformed the k-nearest neighbor method. The results suggest that our method can be an alternative to existing for general classification tasks, especially for data sets which have many missing values.

  • PDF

Spectral clustering: summary and recent research issues (스펙트럴 클러스터링 - 요약 및 최근 연구동향)

  • Jeong, Sanghun;Bae, Suhyeon;Kim, Choongrak
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.2
    • /
    • pp.115-122
    • /
    • 2020
  • K-means clustering uses a spherical or elliptical metric to group data points; however, it does not work well for non-convex data such as the concentric circles. Spectral clustering, based on graph theory, is a generalized and robust technique to deal with non-standard type of data such as non-convex data. Results obtained by spectral clustering often outperform traditional clustering such as K-means. In this paper, we review spectral clustering and show important issues in spectral clustering such as determining the number of clusters K, estimation of scale parameter in the adjacency of two points, and the dimension reduction technique in clustering high-dimensional data.

Market Segmentation Based on Types of Motivations to Visit Coffee Shops (커피전문점 방문동기유형에 따른 시장세분화)

  • Lee, Yong-Sook;Kim, Eun-Jung;Park, Heung-Jin
    • The Korean Journal of Franchise Management
    • /
    • v.7 no.1
    • /
    • pp.21-29
    • /
    • 2016
  • Purpose - The primary purpose of this study is to employ effective marketing methods using market segmentation of coffee shops by determining how motivations to visit coffee shops have different impacts on demographic profile of visitors and characteristics of coffee shop visits, so as to draw out a better understanding of customers of coffee market. Research design, data, and methodology - Data were collected using surveys of self-administered questionnaires toward coffee shop users in Daejeon, Korea. A number of samples used in data analysis were 253 excluding unusable responses. The data were analyzed through frequency, reliability, and factor analysis using SPSS 20.0. Factor analysis was conducted through the principal component analysis and varimax rotation method to derive factors of one or more eigen values. In addition, the cluster analysis, multivariate ANOVA, and cross-tab analysis were used for the market segmentation based on the types of motivation for coffee shop visits. The process of the cluster analysis is as follows. Four clusters were derived through hierarchical clustering, and k-means cluster analysis was then carried out using mean value of the four clusters as the initial seed value. Result - The factor analysis delineated four dimensions of motivation to visit coffee shops: ostentation motivation, hedonic motivation, esthetic motivation, utility motivation. The cluster analysis yielded four clusters: utility and esthetic seekers, hedonic seekers, utility seekers, ostentation seekers. In order to further specify the profile of four clusters, each cluster was cross tabulated with socio-demographics and characteristics of coffee shop visits. Four clusters are significantly different from each other by four types of motivations for coffee shop visits. Conclusions - This study has empirically examined the difference in demographic profile of visitors and characteristics of coffee shop visits by motivation to visit coffee shops. There are significant differences according to age, education background, marital status, occupation and monthly income. In addition, coffee shops use pattern characterization in frequency of visits to coffee shops, relationships with companion, purpose of visit, information sources, brand type, average expense per visit, important elements of selection attribute were significantly different depending on motivations for coffee shop visits.

Development of an unsupervised learning-based ESG evaluation process for Korean public institutions without label annotation

  • Do Hyeok Yoo;SuJin Bak
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.5
    • /
    • pp.155-164
    • /
    • 2024
  • This study proposes an unsupervised learning-based clustering model to estimate the ESG ratings of domestic public institutions. To achieve this, the optimal number of clusters was determined by comparing spectral clustering and k-means clustering. These results are guaranteed by calculating the Davies-Bouldin Index (DBI), a model performance index. The DBI values were 0.734 for spectral clustering and 1.715 for k-means clustering, indicating lower values showed better performance. Thus, the superiority of spectral clustering was confirmed. Furthermore, T-test and ANOVA were used to reveal statistically significant differences between ESG non-financial data, and correlation coefficients were used to confirm the relationships between ESG indicators. Based on these results, this study suggests the possibility of estimating the ESG performance ranking of each public institution without existing ESG ratings. This is achieved by calculating the optimal number of clusters, and then determining the sum of averages of the ESG data within each cluster. Therefore, the proposed model can be employed to evaluate the ESG ratings of various domestic public institutions, and it is expected to be useful in domestic sustainable management practice and performance management.

Statistical methods for testing tumor heterogeneity (종양 이질성을 검정을 위한 통계적 방법론 연구)

  • Lee, Dong Neuck;Lim, Changwon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.331-348
    • /
    • 2019
  • Understanding the tumor heterogeneity due to differences in the growth pattern of metastatic tumors and rate of change is important for understanding the sensitivity of tumor cells to drugs and finding appropriate therapies. It is often possible to test for differences in population means using t-test or ANOVA when the group of N samples is distinct. However, these statistical methods can not be used unless the groups are distinguished as the data covered in this paper. Statistical methods have been studied to test heterogeneity between samples. The minimum combination t-test method is one of them. In this paper, we propose a maximum combinatorial t-test method that takes into account combinations that bisect data at different ratios. Also we propose a method based on the idea that examining the heterogeneity of a sample is equivalent to testing whether the number of optimal clusters is one in the cluster analysis. We verified that the proposed methods, maximum combination t-test method and gap statistic, have better type-I error and power than the previously proposed method based on simulation study and obtained the results through real data analysis.

Surface Extraction from Point-Sampled Data through Region Growing

  • Vieira, Miguel;Shimada, Kenji
    • International Journal of CAD/CAM
    • /
    • v.5 no.1
    • /
    • pp.19-27
    • /
    • 2005
  • As three-dimensional range scanners make large point clouds a more common initial representation of real world objects, a need arises for algorithms that can efficiently process point sets. In this paper, we present a method for extracting smooth surfaces from dense point clouds. Given an unorganized set of points in space as input, our algorithm first uses principal component analysis to estimate the surface variation at each point. After defining conditions for determining the geometric compatibility of a point and a surface, we examine the points in order of increasing surface variation to find points whose neighborhoods can be closely approximated by a single surface. These neighborhoods become seed regions for region growing. The region growing step clusters points that are geometrically compatible with the approximating surface and refines the surface as the region grows to obtain the best approximation of the largest number of points. When no more points can be added to a region, the algorithm stores the extracted surface. Our algorithm works quickly with little user interaction and requires a fraction of the memory needed for a standard mesh data structure. To demonstrate its usefulness, we show results on large point clouds acquired from real-world objects.

Area-constrained NTC Manycore Architecture Design Methodology (면적 제약 조건을 고려한 NTC 매니코어 설계 방법론)

  • Chang, Jin Kyu;Han, Tae Hee
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.866-869
    • /
    • 2015
  • With the advance in semiconductor technology, the number of elements that can be integrated in system-on-chip(SoC) increases exponentially, and thus voltage scaling is indispensable to enhance energy efficiency. Near-threshold voltage computing(NTC) improves the energy efficiency by an order of degree, hence it is able to overcome the limitation of conventional super-threshold voltage computing(STC). Although NTC-based low performance manycore system can be used to maximize energy efficiency, it demands more number of cores to sustain the performance, which results in considerable increase of area. In this paper, we analyze NTC manycore architecture considering the trade-offs between performance, power, and area. Therefore, we propose an algorithmic methodology that can optimize power consumption and area while satisfying the required performance by determining the constrained number of cores and size of caches and clusters in NTC environment. Experimental results show that proposed NTC architecture can reduce power consumption by approximately 16.5 % while maintaining the performance of STC core under area constraint.

  • PDF

Simulation Analysis of User Grouping Algorithms for Massive Smart TV Services (시뮬레이션을 이용한 대규모 스마트 TV 서비스 제공을 위한 사용자 그룹핑 알고리즘 성능 분석)

  • Jeon, Cheol;Lee, Kwan-Seob;Jou, Wou-Seok;Jeong, Tai-Kyeong Ted.;Han, Seung-Chul
    • Journal of the Korea Society for Simulation
    • /
    • v.20 no.1
    • /
    • pp.61-67
    • /
    • 2011
  • Smart TV System will lead to drastic change of communication and media industries as one of the emerging next generation network services. However, when the number of concurrent users increases rapidly, the issue of service quality degradation occurs because providing services to many users simultaneously stresses both the server and the network. The server limitation can be circumvented by deploying server clusters. but the network limitation is far less easy to cope with, due to the difficulty in determining the cause and location of congestion and in provisioning extra resources. In order to alleviate these problems, a number of schemes have been developed. Prior works mostly focus on reducing user-centric performance metrics of individual connection, such as the round-trip time(RTT), downloading time or packet loss rate, but tend to ignore the network loads caused by the concurrent connections or global network load balance. In this work, we make an in-depth investigation on the issue of user grouping for massive Smart TV services through simulations on actual Internet test-bed, PlanetLab.