• Title/Summary/Keyword: Hierarchical K-means clustering

Search Result 88, Processing Time 0.036 seconds

Design of Multiple Model Fuzzy Prediction Systems Based on HCKA (HCKA 기반 다중 모델 퍼지 예측 시스템의 구현)

  • Bang, Young-Keun;Shim, Jae-Son;Park, Ha-Yong;Lee, Chul-Heui
    • Proceedings of the KIEE Conference
    • /
    • 2009.07a
    • /
    • pp.1642_1643
    • /
    • 2009
  • 일반적으로, 퍼지 예측 시스템의 성능은 데이터의 특성과 퍼지 집합을 생성하기 위한 클러스터일 기법에 매우 의존적이다. 하지만, 예측을 위한 시계열 데이터들은 자연현상에 기인하는 강한 비선형적 특성을 가지고 있으므로 적합한 시스템을 구현하는 것에 많은 제약이 따른다. 따라서 본 논문에서는 시계열의 비선형적 특성을 적절히 취급하기 위하여, 그들로부터 생성 가능한 차분 데이터 중, 유효한 차분데이터를 이용하여 다중 모델 퍼지 예측 시스템을 구현함으로써, 보다 우수한 예측이 가능하도록 하였으며, 퍼지 시스템의 모델링에는 교차 상관분석기법에 따른 계층적 구조의 클러스터링 기법 (Hierarchical Cross-correlation and K-means Clustering Algorithms: HCKA)을 적용하여, 시스템을 위한 규칙기반의 적합성을 높일 수 있도록 하였다.

  • PDF

A Comparative Study on the Agglomerative and Divisive Methods for Hierarchical Document Clustering (계층적 문서 클러스터링을 위한 응집식 기법과 분할식 기법의 비교 연구)

  • Lee, Jae-Yun;Jeong, Jin-Ah
    • Proceedings of the Korean Society for Information Management Conference
    • /
    • 2005.08a
    • /
    • pp.65-70
    • /
    • 2005
  • 계층적 문서 클러스터링에 있어서 실험집단에 따라 응집식 기법과 분할식 기법의 성능이 다르며, 이를 좌우하는 요소는 분류의 깊이, 즉 분류수준이라고 가정하였다. 조금만 나누면 되는 대분류인 경우는 상대적으로 분할식 기법이 유리하고, 조금만 합치면 되는 소분류인 경우에는 응집식 기법이 유리할 것이라고 판단했기 때문이다. 그에 따라 분할식 클러스터링 기법인 양분(Bisecting) K-means기법과 응집식 기법인 완전연결, 평균연결, WARD기법의 성능을 실험집단이 대분류인 경우와 소분류인 경우의 유사계수를 적용하여 각 기법별 성능을 비교하여 실험집단의 특성에 따른 적합 클러스터링 기법을 찾고자 하였다. 실험결과 응집식 기법과 분할식 기법의 성능 우열에 영향을 미치는 것은 분류수준보다는 변이계수로 측정된 상대적인 군집의 크기 편차인 것으로 나타났다.

  • PDF

On Constructing NURBS Surface Model from Scattered and Unorganized 3-D Range Data (정렬되지 않은 3차원 거리 데이터로부터의 NURBS 곡면 모델 생성 기법)

  • Park, In-Kyu;Yun, Il-Dong;Lee, Sang-Uk
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.37 no.3
    • /
    • pp.17-30
    • /
    • 2000
  • In this paper, we propose an efficient algorithm to produce 3-D surface model from a set of range data, based on NURBS (Non-Uniform Rational B-Splines) surface fitting technique. It is assumed that the range data is initially unorganized and scattered 3-D points, while their connectivity is also unknown. The proposed algorithm consists of three steps: initial model approximation, hierarchical representation, and construction of the NURBS patch network. The mitral model is approximated by polyhedral and triangular model using K-means clustering technique Then, the initial model is represented by hierarchically decomposed tree structure. Based on this, $G^1$ continuous NURBS patch network is constructed efficiently. The computational complexity as well as the modeling error is much reduced by means of hierarchical decomposition and precise approximation of the NURBS control mesh Experimental results show that the initial model as well as the NURBS patch network are constructed automatically, while the modeling error is observed to be negligible.

  • PDF

Country Clustering Based on Environmental Factors Influencing on Software Piracy (소프트웨어 불법복제에 영향을 미치는 환경 요인에 기반한 국가 분류)

  • Suh, Bomil;Shim, Junho
    • The Journal of Information Systems
    • /
    • v.26 no.4
    • /
    • pp.227-246
    • /
    • 2017
  • Purpose: As the importance of software has been emphasized recently, the size of the software market is continuously expanding. The development of the software market is being adversely affected by software piracy. In this study, we try to classify countries around the world based on the macro environmental factors, which influence software piracy. We also try to identify the differences in software piracy for each classified type. Design/methodology/approach: The data-driven approach is used in this study. From the BSA, the World Bank, and the OECD, we collect data from 1990 to 2015 for 127 environmental variables of 225 countries. Cronbach's ${\alpha}$ analysis, item-to-total correlation analysis, and exploratory factor analysis derive 15 constructs from the data. We apply two-step approach to cluster analysis. The number of clusters is determined to be 5 by hierarchical cluster analysis at the first step, and the countries are classified by the K-means clustering at the second step. We conduct ANOVA and MANOVA in order to verify the differences of the environmental factors and software piracy among derived clusters. Findings: The five clusters are identified as underdeveloped countries, developing countries, developed countries, world powers, and developing country with large market. There are statistically significant differences in the environmental factors among the clusters. In addition, there are statistically significant differences in software piracy rate, pirated value, and legal software sales among the clusters.

A Characteristic Analysis and Countermeasure Study of the Hedging of Listed Companies in China Stock Markets

  • WU, Guo-Hua;JIANG, Xiao-Ling;DENG, Su-Ya
    • The Journal of Asian Finance, Economics and Business
    • /
    • v.8 no.10
    • /
    • pp.147-158
    • /
    • 2021
  • Due to COVID-19, the risk of price volatility in commodity and equity markets increases. The research and application of hedging is the most effective way to reduce the market risk. Hedging is a risk management strategy employed to offset losses in investments by taking an opposite position in a related asset. We use K-means and hierarchical clustering methods to cluster companies and futures products respectively, and analyze the relationship between the number of hedging firms, regional distribution, nature of firms, capital distribution, company size, profitability, number of local Futures Commission Merchants (FCMs), regional location, and listing time. The study shows that listed companies with large scale and good profitability invest more money in hedging, while state-owned enterprises' participation in hedging is more likely to be affected by the company size and the number of local futures commission merchants, and private enterprises are more likely to be affected by the company profitability and the regional location. Listed companies are more willing to choose long-listed and mature futures products for hedging. We also provide policy advice based on our conclusion. So far, there is no study on the characteristics of hedging. This paper fills the gap. The results provide a basis and guidance for people's investment and risk management. Using clustering analysis in hedging study is another innovation of this paper.

Impurity profiling and chemometric analysis of methamphetamine seizures in Korea

  • Shin, Dong Won;Ko, Beom Jun;Cheong, Jae Chul;Lee, Wonho;Kim, Suhkmann;Kim, Jin Young
    • Analytical Science and Technology
    • /
    • v.33 no.2
    • /
    • pp.98-107
    • /
    • 2020
  • Methamphetamine (MA) is currently the most abused illicit drug in Korea. MA is produced by chemical synthesis, and the final target drug that is produced contains small amounts of the precursor chemicals, intermediates, and by-products. To identify and quantify these trace compounds in MA seizures, a practical and feasible approach for conducting chromatographic fingerprinting with a suite of traditional chemometric methods and recently introduced machine learning approaches was examined. This was achieved using gas chromatography (GC) coupled with a flame ionization detector (FID) and mass spectrometry (MS). Following appropriate examination of all the peaks in 71 samples, 166 impurities were selected as the characteristic components. Unsupervised (principal component analysis (PCA), hierarchical cluster analysis (HCA), and K-means clustering) and supervised (partial least squares-discriminant analysis (PLS-DA), orthogonal partial least squares-discriminant analysis (OPLS-DA), support vector machines (SVM), and deep neural network (DNN) with Keras) chemometric techniques were employed for classifying the 71 MA seizures. The results of the PCA, HCA, K-means clustering, PLS-DA, OPLS-DA, SVM, and DNN methods for quality evaluation were in good agreement. However, the tested MA seizures possessed distinct features, such as chirality, cutting agents, and boiling points. The study indicated that the established qualitative and semi-quantitative methods will be practical and useful analytical tools for characterizing trace compounds in illicit MA seizures. Moreover, they will provide a statistical basis for identifying the synthesis route, sources of supply, trafficking routes, and connections between seizures, which will support drug law enforcement agencies in their effort to eliminate organized MA crime.

Effective Classification Method of Hierarchical CNN for Multi-Class Outlier Detection (다중 클래스 이상치 탐지를 위한 계층 CNN의 효과적인 클래스 분할 방법)

  • Kim, Jee-Hyun;Lee, Seyoung;Kim, Yerim;Ahn, Seo-Yeong;Park, Saerom
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2022.07a
    • /
    • pp.81-84
    • /
    • 2022
  • 제조 산업에서의 이상치 검출은 생산품의 품질과 운영비용을 절감하기 위한 중요한 요소로 최근 딥러닝을 사용하여 자동화되고 있다. 이상치 검출을 위한 딥러닝 기법에는 CNN이 있으며, CNN을 계층적으로 구성할 경우 단일 CNN 모델에 비해 상대적으로 성능의 향상을 보일 수 있다는 것이 많은 선행 연구에서 나타났다. 이에 MVTec-AD 데이터셋을 이용하여 계층 CNN이 다중 클래스 이상치 판별 문제에 대해 효과적인지를 탐구하고자 하였다. 실험 결과 단일 CNN의 정확도는 0.7715, 계층 CNN의 정확도는 0.7838로 다중 클래스 이상치 판별 문제에 있어 계층 CNN 방식 접근이 다중 클래스 이상치 탐지 문제에서 알고리즘의 성능을 향상할 수 있음을 확인할 수 있었다. 계층 CNN은 모델과 파라미터의 개수와 리소스의 사용이 단일 CNN에 비하여 기하급수적으로 증가한다는 단점이 존재한다. 이에 계층 CNN의 장점을 유지하며 사용 리소스를 절약하고자 하였고 K-means, GMM, 계층적 클러스터링 알고리즘을 통해 제작한 새로운 클래스를 이용해 계층 CNN을 구성하여 각각 정확도 0.7930, 0.7891, 0.7936의 결과를 얻을 수 있었다. 이를 통해 Clustering 알고리즘을 사용하여 적절히 물체를 분류할 경우 물체에 따른 개별 상태 판단 모델을 제작하는 것과 비슷하거나 더 좋은 성능을 내며 리소스 사용을 줄일 수 있음을 확인할 수 있었다.

  • PDF

FCAnalyzer: A Functional Clustering Analysis Tool for Predicted Transcription Regulatory Elements and Gene Ontology Terms

  • Kim, Sang-Bae;Ryu, Gil-Mi;Kim, Young-Jin;Heo, Jee-Yeon;Park, Chan;Oh, Berm-Seok;Kim, Hyung-Lae;Kimm, Ku-Chan;Kim, Kyu-Won;Kim, Young-Youl
    • Genomics & Informatics
    • /
    • v.5 no.1
    • /
    • pp.10-18
    • /
    • 2007
  • Numerous studies have reported that genes with similar expression patterns are co-regulated. From gene expression data, we have assumed that genes having similar expression pattern would share similar transcription factor binding sites (TFBSs). These function as the binding regions for transcription factors (TFs) and thereby regulate gene expression. In this context, various analysis tools have been developed. However, they have shortcomings in the combined analysis of expression patterns and significant TFBSs and in the functional analysis of target genes of significantly overrepresented putative regulators. In this study, we present a web-based A Functional Clustering Analysis Tool for Predicted Transcription Regulatory Elements and Gene Ontology Terms (FCAnalyzer). This system integrates microarray clustering data with similar expression patterns, and TFBS data in each cluster. FCAnalyzer is designed to perform two independent clustering procedures. The first process clusters gene expression profiles using the K-means clustering method, and the second process clusters predicted TFBSs in the upstream region of previously clustered genes using the hierarchical biclustering method for simultaneous grouping of genes and samples. This system offers retrieved information for predicted TFBSs in each cluster using $Match^{TM}$ in the TRANSFAC database. We used gene ontology term analysis for functional annotation of genes in the same cluster. We also provide the user with a combinatorial TFBS analysis of TFBS pairs. The enrichment of TFBS analysis and GO term analysis is statistically by the calculation of P values based on Fisher’s exact test, hypergeometric distribution and Bonferroni correction. FCAnalyzer is a web-based, user-friendly functional clustering analysis system that facilitates the transcriptional regulatory analysis of co-expressed genes. This system presents the analyses of clustered genes, significant TFBSs, significantly enriched TFBS combinations, their target genes and TFBS-TF pairs.

Design of HCBKA-Based TSK Fuzzy Prediction System with Error Compensation (HCBKA 기반 오차 보정형 TSK 퍼지 예측시스템 설계)

  • Bang, Young-Keun;Lee, Chul-Heui
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.59 no.6
    • /
    • pp.1159-1166
    • /
    • 2010
  • To improve prediction quality of a nonlinear prediction system, the system's capability for uncertainty of nonlinear data should be satisfactory. This paper presents a TSK fuzzy prediction system that can consider and deal with the uncertainty of nonlinear data sufficiently. In the design procedures of the proposed system, HCBKA(Hierarchical Correlationship-Based K-means clustering Algorithm) was used to generate the accurate fuzzy rule base that can control output according to input efficiently, and the first-order difference method was applied to reflect various characteristics of the nonlinear data. Also, multiple prediction systems were designed to analyze the prediction tendencies of each difference data generated by the difference method. In addition, to enhance the prediction quality of the proposed system, an error compensation method was proposed and it compensated the prediction error of the systems suitably. Finally, the prediction performance of the proposed system was verified by simulating two typical time series examples.

Classification of Daily Precipitation Patterns in South Korea using Mutivariate Statistical Methods

  • Mika, Janos;Kim, Baek-Jo;Park, Jong-Kil
    • Journal of Environmental Science International
    • /
    • v.15 no.12
    • /
    • pp.1125-1139
    • /
    • 2006
  • The cluster analysis of diurnal precipitation patterns is performed by using daily precipitation of 59 stations in South Korea from 1973 to 1996 in four seasons of each year. Four seasons are shifted forward by 15 days compared to the general ones. Number of clusters are 15 in winter, 16 in spring and autumn, and 26 in summer, respectively. One of the classes is the totally dry day in each season, indicating that precipitation is never observed at any station. This is treated separately in this study. Distribution of the days among the clusters is rather uneven with rather low area-mean precipitation occurring most frequently. These 4 (seasons)$\times$2 (wet and dry days) classes represent more than the half (59 %) of all days of the year. On the other hand, even the smallest seasonal clusters show at least $5\sim9$ members in the 24 years (1973-1996) period of classification. The cluster analysis is directly performed for the major $5\sim8$ non-correlated coefficients of the diurnal precipitation patterns obtained by factor analysis In order to consider the spatial correlation. More specifically, hierarchical clustering based on Euclidean distance and Ward's method of agglomeration is applied. The relative variance explained by the clustering is as high as average (63%) with better capability in spring (66%) and winter (69 %), but lower than average in autumn (60%) and summer (59%). Through applying weighted relative variances, i.e. dividing the squared deviations by the cluster averages, we obtain even better values, i.e 78 % in average, compared to the same index without clustering. This means that the highest variance remains in the clusters with more precipitation. Besides all statistics necessary for the validation of the final classification, 4 cluster centers are mapped for each season to illustrate the range of typical extremities, paired according to their area mean precipitation or negative pattern correlation. Possible alternatives of the performed classification and reasons for their rejection are also discussed with inclusion of a wide spectrum of recommended applications.