• Title/Summary/Keyword: Clustering test

Search Result 377, Processing Time 0.036 seconds

Nonparametric analysis of income distributions among different regions based on energy distance with applications to China Health and Nutrition Survey data

  • Ma, Zhihua;Xue, Yishu;Hu, Guanyu
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.1
    • /
    • pp.57-67
    • /
    • 2019
  • Income distribution is a major concern in economic theory. In regional economics, it is often of interest to compare income distributions in different regions. Traditional methods often compare the income inequality of different regions by assuming parametric forms of the income distributions, or using summary statistics like the Gini coefficient. In this paper, we propose a nonparametric procedure to test for heterogeneity in income distributions among different regions, and a K-means clustering procedure for clustering income distributions based on energy distance. In simulation studies, it is shown that the energy distance based method has competitive results with other common methods in hypothesis testing, and the energy distance based clustering method performs well in the clustering problem. The proposed approaches are applied in analyzing data from China Health and Nutrition Survey 2011. The results indicate that there are significant differences among income distributions of the 12 provinces in the dataset. After applying a 4-means clustering algorithm, we obtained the clustering results of the income distributions in the 12 provinces.

An Improved K-means Document Clustering using Concept Vectors

  • Shin, Yang-Kyu
    • Journal of the Korean Data and Information Science Society
    • /
    • v.14 no.4
    • /
    • pp.853-861
    • /
    • 2003
  • An improved K-means document clustering method has been presented, where a concept vector is manipulated for each cluster on the basis of cosine similarity of text documents. The concept vectors are unit vectors that have been normalized on the n-dimensional sphere. Because the standard K-means method is sensitive to initial starting condition, our improvement focused on starting condition for estimating the modes of a distribution. The improved K-means clustering algorithm has been applied to a set of text documents, called Classic3, to test and prove efficiency and correctness of clustering result, and showed 7% improvements in its worst case.

  • PDF

Development of a Clustering Model for Automatic Knowledge Classification (지식 분류의 자동화를 위한 클러스터링 모형 연구)

  • 정영미;이재윤
    • Journal of the Korean Society for information Management
    • /
    • v.18 no.2
    • /
    • pp.203-230
    • /
    • 2001
  • The purpose of this study is to develop a document clustering model for automatic classification of knowledge. Two test collections of newspaper article texts and journal article abstracts are built for the clustering experiment. Various feature reduction criteria as well as term weighting methods are applied to the term sets of the test collections, and cosine and Jaccard coefficients are used as similarity measures. The performances of complete linkage and K-means clustering algorithms are compared using different feature selection methods and various term weights. It was found that complete linkage clustering outperforms K-means algorithm and feature reduction up to almost 10% of the total feature sets does not lower the performance of document clustering to any significant extent.

  • PDF

Metastasis Related Gene Exploration Using TwoStep Clustering for Medulloblastoma Microarray Data

  • Ban, Sung-Su;Park, Hee-Chang
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2005.10a
    • /
    • pp.153-159
    • /
    • 2005
  • Microarray gene expression technology has applications that could refine diagnosis and therapeutic monitoring as well as improve disease prevention through risk assessment and early detection. Especially, microarray expression data can provide important information regarding specific genes related with metastasis through an appropriate analysis. Various methods for clustering analysis microarray data have been introduced so far. We used twostep clustering fot ascertain metastasis related gene through t-test. Through t-test between two groups for two publicly available medulloblastoma microarray data sets, we intended to find significant gene for metastasis. The paper describes the process in detail showing how the process is applied to clustering analysis and t-test for microarray datasets and how the metastasis-associated genes are explorated.

  • PDF

A Study on Data Clustering Method Using Local Probability (국부 확률을 이용한 데이터 분류에 관한 연구)

  • Son, Chang-Ho;Choi, Won-Ho;Lee, Jae-Kook
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.13 no.1
    • /
    • pp.46-51
    • /
    • 2007
  • In this paper, we propose a new data clustering method using local probability and hypothesis theory. To cluster the test data set we analyze the local area of the test data set using local probability distribution and decide the candidate class of the data set using mean standard deviation and variance etc. To decide each class of the test data, statistical hypothesis theory is applied to the decided candidate class of the test data set. For evaluating, the proposed classification method is compared to the conventional fuzzy c-mean method, k-means algorithm and Discriminator analysis algorithm. The simulation results show more accuracy than results of fuzzy c-mean method, k-means algorithm and Discriminator analysis algorithm.

Nonlinear structural finite element model updating with a focus on model uncertainty

  • Mehrdad, Ebrahimi;Reza Karami, Mohammadi;Elnaz, Nobahar;Ehsan Noroozinejad, Farsangi
    • Earthquakes and Structures
    • /
    • v.23 no.6
    • /
    • pp.549-580
    • /
    • 2022
  • This paper assesses the influences of modeling assumptions and uncertainties on the performance of the non-linear finite element (FE) model updating procedure and model clustering method. The results of a shaking table test on a four-story steel moment-resisting frame are employed for both calibrations and clustering of the FE models. In the first part, simple to detailed non-linear FE models of the test frame is calibrated to minimize the difference between the various data features of the models and the structure. To investigate the effect of the specified data feature, four of which include the acceleration, displacement, hysteretic energy, and instantaneous features of responses, have been considered. In the last part of the work, a model-based clustering approach to group models of a four-story frame with similar behavior is introduced to detect abnormal ones. The approach is a composition of property derivation, outlier removal based on k-Nearest neighbors, and a K-means clustering approach using specified data features. The clustering results showed correlations among similar models. Moreover, it also helped to detect the best strategy for modeling different structural components.

DNA Marker Mining of BMS1167 Microsatellite Locus in Hanwoo Chromosome 17

  • Lee, Jea-Young;Lee, Yong-Won;Kwon, Jae-Chul
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.2
    • /
    • pp.325-333
    • /
    • 2006
  • We describe tests for detecting and locating quantitative traits loci (QTL) for traits in Hanwoo. Lod scores and a permutation test have been described. From results of a permutation test to detect QTL, we select major DNA markers of BMS1167 microsatellite locus in Hanwoo chromosome 17 for further analysis. K-means clustering analysis applied to four traits and eight DNA markers in BMS1167 resulted in three cluster groups. We conclude that the major DNA markers of BMS1167 microsatellite locus in Hanwoo chromosome 17 are markers 100bp, 108bp and 110bp.

  • PDF

A Major DNA Marker Mining of BMS941 Microsatellite Locus in Hanwoo Chromosome 17

  • Lee, Jea-Young;Lee, Yong-Won
    • Journal of the Korean Data and Information Science Society
    • /
    • v.16 no.4
    • /
    • pp.913-921
    • /
    • 2005
  • We describe tests for detecting and locating quantitative traits loci (QTL) for traits in Hanwoo. Lod scores and a permutation test have been described. From results of a permutation test to detect QTL, we select major DNA markers of BMS941 microsatellite locus in Hanwoo chromosome 17 for further analysis. K-means clustering analysis applied to four traits and eight DNA markers in BMS941 resulted in three cluster groups. We conclude that the major DNA markers of BMS941 microsatellite locus in Hanwoo chromosome 17 are markers 80bp, 85bp 90bp and 105bp.

  • PDF

Clustering and classification to characterize daily electricity demand (시간단위 전력사용량 시계열 패턴의 군집 및 분류분석)

  • Park, Dain;Yoon, Sanghoo
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.2
    • /
    • pp.395-406
    • /
    • 2017
  • The purpose of this study is to identify the pattern of daily electricity demand through clustering and classification. The hourly data was collected by KPS (Korea Power Exchange) between 2008 and 2012. The time trend was eliminated for conducting the pattern of daily electricity demand because electricity demand data is times series data. We have considered k-means clustering, Gaussian mixture model clustering, and functional clustering in order to find the optimal clustering method. The classification analysis was conducted to understand the relationship between external factors, day of the week, holiday, and weather. Data was divided into training data and test data. Training data consisted of external factors and clustered number between 2008 and 2011. Test data was daily data of external factors in 2012. Decision tree, random forest, Support vector machine, and Naive Bayes were used. As a result, Gaussian model based clustering and random forest showed the best prediction performance when the number of cluster was 8.

OPTIMIZATION OF THE TEST INTERVALS OF A NUCLEAR SAFETY SYSTEM BY GENETIC ALGORITHMS, SOLUTION CLUSTERING AND FUZZY PREFERENCE ASSIGNMENT

  • Zio, E.;Bazzo, R.
    • Nuclear Engineering and Technology
    • /
    • v.42 no.4
    • /
    • pp.414-425
    • /
    • 2010
  • In this paper, a procedure is developed for identifying a number of representative solutions manageable for decision-making in a multiobjective optimization problem concerning the test intervals of the components of a safety system of a nuclear power plant. Pareto Front solutions are identified by a genetic algorithm and then clustered by subtractive clustering into "families". On the basis of the decision maker's preferences, each family is then synthetically represented by a "head of the family" solution. This is done by introducing a scoring system that ranks the solutions with respect to the different objectives: a fuzzy preference assignment is employed to this purpose. Level Diagrams are then used to represent, analyze and interpret the Pareto Fronts reduced to the head-of-the-family solutions.