• Title/Summary/Keyword: Clustering sampling

Search Result 86, Processing Time 0.02 seconds

K-means clustering using a center of gravity for grid-based sample (그리드 기반 표본의 무게중심을 이용한 케이-평균군집화)

  • Lee, Sun-Myung;Park, Hee-Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.1
    • /
    • pp.121-128
    • /
    • 2010
  • K-means clustering is an iterative algorithm in which items are moved among sets of clusters until the desired set is reached. K-means clustering has been widely used in many applications, such as market research, pattern analysis or recognition, image processing, etc. It can identify dense and sparse regions among data attributes or object attributes. But k-means algorithm requires many hours to get k clusters that we want, because it is more primitive, explorative. In this paper we propose a new method of k-means clustering using a center of gravity for grid-based sample. It is more fast than any traditional clustering method and maintains its accuracy.

One-step spectral clustering of weighted variables on single-cell RNA-sequencing data (단세포 RNA 시퀀싱 데이터를 위한 가중변수 스펙트럼 군집화 기법)

  • Park, Min Young;Park, Seyoung
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.4
    • /
    • pp.511-526
    • /
    • 2020
  • Single-cell RNA-sequencing (scRNA-seq) data consists of each cell's RNA expression extracted from large populations of cells. One main purpose of using scRNA-seq data is to identify inter-cellular heterogeneity. However, scRNA-seq data pose statistical challenges when applying traditional clustering methods because they have many missing values and high level of noise due to technical and sampling issues. In this paper, motivated by analyzing scRNA-seq data, we propose a novel spectral-based clustering method by imposing different weights on genes when computing a similarity between cells. Assigning weights on genes and clustering cells are performed simultaneously in the proposed clustering framework. We solve the proposed non-convex optimization using an iterative algorithm. Both real data application and simulation study suggest that the proposed clustering method better identifies underlying clusters compared with existing clustering methods.

Impact of snowball sampling ratios on network characteristics estimation: A case study of Cyworld (스노우볼 샘플링 비율에 따른 네트워크의 특성 변화: 싸이월드의 사례 연구)

  • Kwak, Hae-Woon;Han, Seung-Yeop;Ahn, Yong-Yeol;Moon, Sue;Jeong, Ha-Woong
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.10d
    • /
    • pp.135-139
    • /
    • 2006
  • Today's social networking services have tens of millions of users, and are growing fast. Their sheer size poses a significant challenge in capturing and analyzing their topological characteristics. Snowball sampling is a popular method to crawl and sample network topologies, but requires a high sampling ratio for accurate estimation of certain metrics. In this work, we evaluate how close topological characteristics of snowball sampled networks are to the complete network. Instead of using a synthetically generated topology, we use the complete topology of Cyworld ilchon network. The goal of this work is to determine sampling ratios for accurate estimation of key topological characteristics, such as the degree distribution, the degree correlation, the assortativity, and the clustering coefficient.

  • PDF

Optimal SVM learning method based on adaptive sparse sampling and granularity shift factor

  • Wen, Hui;Jia, Dongshun;Liu, Zhiqiang;Xu, Hang;Hao, Guangtao
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.4
    • /
    • pp.1110-1127
    • /
    • 2022
  • To improve the training efficiency and generalization performance of a support vector machine (SVM) in a large-scale set, an optimal SVM learning method based on adaptive sparse sampling and the granularity shift factor is presented. The proposed method combines sampling optimization with learner optimization. First, an adaptive sparse sampling method based on the potential function density clustering is designed to adaptively obtain sparse sampling samples, which can achieve a reduction in the training sample set and effectively approximate the spatial structure distribution of the original sample set. A granularity shift factor method is then constructed to optimize the SVM decision hyperplane, which fully considers the neighborhood information of each granularity region in the sparse sampling set. Experiments on an artificial dataset and three benchmark datasets show that the proposed method can achieve a relatively higher training efficiency, as well as ensure a good generalization performance of the learner. Finally, the effectiveness of the proposed method is verified.

A new structural reliability analysis method based on PC-Kriging and adaptive sampling region

  • Yu, Zhenliang;Sun, Zhili;Guo, Fanyi;Cao, Runan;Wang, Jian
    • Structural Engineering and Mechanics
    • /
    • v.82 no.3
    • /
    • pp.271-282
    • /
    • 2022
  • The active learning surrogate model based on adaptive sampling strategy is increasingly popular in reliability analysis. However, most of the existing sampling strategies adopt the trial and error method to determine the size of the Monte Carlo (MC) candidate sample pool which satisfies the requirement of variation coefficient of failure probability. It will lead to a reduction in the calculation efficiency of reliability analysis. To avoid this defect, a new method for determining the optimal size of the MC candidate sample pool is proposed, and a new structural reliability analysis method combining polynomial chaos-based Kriging model (PC-Kriging) with adaptive sampling region is also proposed (PCK-ASR). Firstly, based on the lower limit of the confidence interval, a new method for estimating the optimal size of the MC candidate sample pool is proposed. Secondly, based on the upper limit of the confidence interval, an adaptive sampling region strategy similar to the radial centralized sampling method is developed. Then, the k-means++ clustering technique and the learning function LIF are used to complete the adaptive design of experiments (DoE). Finally, the effectiveness and accuracy of the PCK-ASR method are verified by three numerical examples and one practical engineering example.

Prevalence Rates and Risk Factors of Metabolic Disorder in Urban Adults assessed in Home Visits (가정방문을 통한 일 광역시 성인의 대사증후군 유병률 및 위험요인 조사)

  • Kim, Jong-Im
    • Journal of Home Health Care Nursing
    • /
    • v.16 no.1
    • /
    • pp.12-21
    • /
    • 2009
  • Purpose: The survey-based study aimed to determine the distribution and clustering tendency of metabolic syndrome risk factors in urban residents, and cluster odds ratios. Methods: Cluster sampling involved 827 urban participants and analysis of the collected data. Results: Regarding the prevalence of metabolic syndrome risk factors used for diagnosis, abdominal obesity was higher in women(69.5%) than in men(34.3%), high blood pressure was higher in men(57%) than in women(46.5%), and blood sugar was higher in men(6.9%) than in women(5.7%). Clustering increased with increasing body mass index(BMI), weight:height ratio(W/Ht) and abdominal obesity Risk factors for females were 1.7 times higher than for males. Participants with a family history of metabolic syndrome displayed related risk factors 1.5 times more than participants without a family history. Participants having a BMI ranking them as obese were 9.5 times more likely to display metabolic syndrome risk factors than non-obese participants. Obese participants were 20 times more likely to display risk factors than non-obese participants. Conclusion: BMI, W/Ht and abdominal obesity correlate with clustering of metabolic syndrome risk factors. The risk is increased by smoking and family history. Exercise weight control and non-smoking are recommended for comprehensive management of clustering of metabolic syndrome risk factors.

  • PDF

Construction and Application of Network Design System for Optimal Water Quality Monitoring in Reservoir (저수지 최적수질측정망 구축시스템 개발 및 적용)

  • Lee, Yo-Sang;Kwon, Se-Hyug;Lee, Sang-Uk;Ban, Yang-Jin
    • Journal of Korea Water Resources Association
    • /
    • v.44 no.4
    • /
    • pp.295-304
    • /
    • 2011
  • For effective water quality management, it is necessary to secure reliable water quality information. There are many variables that need to be included in a comprehensive practical monitoring network : representative sampling locations, suitable sampling frequencies, water quality variable selection, and budgetary and logistical constraints are examples, especially sampling location is considered to be the most important issues. Until now, monitoring network design for water quality management was set according to the qualitative judgments, which is a problem of representativeness. In this paper, we propose network design system for optimal water quality monitoring using the scientific statistical techniques. Network design system is made based on the SAS program of version 9.2 and configured with simple input system and user friendly outputs considering the convenience of users. It applies to Excel data format for ease to use and all data of sampling location is distinguished to sheet base. In this system, time plots, dendrogram, and scatter plots are shown as follows: Time plots of water quality variables are graphed for identifying variables to classify sampling locations significantly. Similarities of sampling locations are calculated using euclidean distances of principal component variables and dimension coordinate of multidimensional scaling method are calculated and dendrogram by clustering analysis is represented and used for users to choose an appropriate number of clusters. Scatter plots of principle component variables are shown for clustering information with sampling locations and representative location.

Inappropriate Survey Design Analysis of the Korean National Health and Nutrition Examination Survey May Produce Biased Results

  • Kim, Yangho;Park, Sunmin;Kim, Nam-Soo;Lee, Byung-Kook
    • Journal of Preventive Medicine and Public Health
    • /
    • v.46 no.2
    • /
    • pp.96-104
    • /
    • 2013
  • Objectives: The inherent nature of the Korean National Health and Nutrition Examination Survey (KNHANES) design requires special analysis by incorporating sample weights, stratification, and clustering not used in ordinary statistical procedures. Methods: This study investigated the proportion of research papers that have used an appropriate statistical methodology out of the research papers analyzing the KNHANES cited in the PubMed online system from 2007 to 2012. We also compared differences in mean and regression estimates between the ordinary statistical data analyses without sampling weight and design-based data analyses using the KNHANES 2008 to 2010. Results: Of the 247 research articles cited in PubMed, only 19.8% of all articles used survey design analysis, compared with 80.2% of articles that used ordinary statistical analysis, treating KNHANES data as if it were collected using a simple random sampling method. Means and standard errors differed between the ordinary statistical data analyses and design-based analyses, and the standard errors in the design-based analyses tended to be larger than those in the ordinary statistical data analyses. Conclusions: Ignoring complex survey design can result in biased estimates and overstated significance levels. Sample weights, stratification, and clustering of the design must be incorporated into analyses to ensure the development of appropriate estimates and standard errors of these estimates.

Bayesian analysis of finite mixture model with cluster-specific random effects (군집 특정 변량효과를 포함한 유한 혼합 모형의 베이지안 분석)

  • Lee, Hyejin;Kyung, Minjung
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.1
    • /
    • pp.57-68
    • /
    • 2017
  • Clustering algorithms attempt to find a partition of a finite set of objects in to a potentially predetermined number of nonempty subsets. Gibbs sampling of a normal mixture of linear mixed regressions with a Dirichlet prior distribution calculates posterior probabilities when the number of clusters was known. Our approach provides simultaneous partitioning and parameter estimation with the computation of classification probabilities. A Monte Carlo study of curve estimation results showed that the model was useful for function estimation. Examples are given to show how these models perform on real data.

External Noise Analysis Algorithm based on FCM Clustering for Nonlinear Maneuvering Target (FCM 클러스터링 기반 비선형 기동표적의 외란분석 알고리즘)

  • Son, Hyun-Seung;Park, Jin-Bae;Joo, Young-Hoon
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.60 no.12
    • /
    • pp.2346-2351
    • /
    • 2011
  • This paper presents the intelligent external noise analysis method for nonlinear maneuvering target. After recognizing maneuvering pattern of the target by the proposed method, we track the state of the target. The external noise can be divided into mere noise and acceleration using only the measurement. divided noise passes through the filtering step and acceleration is punched into dynamic model to compensate expected states. The acceleration is the most deterministic factor to the maneuvering. By dividing, approximating, and compensating the acceleration, we can reduce the tracking error effectively. We use the fuzzy c-means (FCM) clustering as the method to divide external noise. FCM can separate the acceleration from the noise without criteria. It makes the criteria with the data made by measurement at every sampling time. So it can show the adaptive tracking result. The proposed method proceeds the tracking target simultaneously with the learning process. Thus it can apply to the online system. The proposed method shows the remarkable tracking result on the linear and nonlinear maneuvering. Finally, some examples are provided to show the feasibility of the proposed algorithm.