• Title/Summary/Keyword: statistical clustering method

Search Result 231, Processing Time 0.04 seconds

Independent Feature Subspace Analysis for Gene Expression Data (유전자 발현 데이터의 독립 특징 부공간 해석)

  • Kim, Heijin;Park, Seungjin;Bang, Sung-Yang
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.10c
    • /
    • pp.739-742
    • /
    • 2002
  • This paper addresses a new statistical method, IFSAcycle, which is an unsupervised learning method of analyzing cell cycle-related gene expression data. The IFSAcycle is based on the independent feature subspace analysis (IFAS) [3], which generalizes the independent component analysis (ICA). Experimental results show the usefulness of IFAS: (1) the ability of assigning genes to multiple coexpression pattern groups; (2) the capability of clustering key genes that determine each critical point of cell cycle.

  • PDF

Classification of Daily Precipitation Patterns in South Korea using Mutivariate Statistical Methods

  • Mika, Janos;Kim, Baek-Jo;Park, Jong-Kil
    • Journal of Environmental Science International
    • /
    • v.15 no.12
    • /
    • pp.1125-1139
    • /
    • 2006
  • The cluster analysis of diurnal precipitation patterns is performed by using daily precipitation of 59 stations in South Korea from 1973 to 1996 in four seasons of each year. Four seasons are shifted forward by 15 days compared to the general ones. Number of clusters are 15 in winter, 16 in spring and autumn, and 26 in summer, respectively. One of the classes is the totally dry day in each season, indicating that precipitation is never observed at any station. This is treated separately in this study. Distribution of the days among the clusters is rather uneven with rather low area-mean precipitation occurring most frequently. These 4 (seasons)$\times$2 (wet and dry days) classes represent more than the half (59 %) of all days of the year. On the other hand, even the smallest seasonal clusters show at least $5\sim9$ members in the 24 years (1973-1996) period of classification. The cluster analysis is directly performed for the major $5\sim8$ non-correlated coefficients of the diurnal precipitation patterns obtained by factor analysis In order to consider the spatial correlation. More specifically, hierarchical clustering based on Euclidean distance and Ward's method of agglomeration is applied. The relative variance explained by the clustering is as high as average (63%) with better capability in spring (66%) and winter (69 %), but lower than average in autumn (60%) and summer (59%). Through applying weighted relative variances, i.e. dividing the squared deviations by the cluster averages, we obtain even better values, i.e 78 % in average, compared to the same index without clustering. This means that the highest variance remains in the clusters with more precipitation. Besides all statistics necessary for the validation of the final classification, 4 cluster centers are mapped for each season to illustrate the range of typical extremities, paired according to their area mean precipitation or negative pattern correlation. Possible alternatives of the performed classification and reasons for their rejection are also discussed with inclusion of a wide spectrum of recommended applications.

A Case Study on Risk Levels of Shoulder Postures Associated with Work-related Musculoskeletal Disorders at Automobile Manufacturing Industry (자동차 조립업종 작업의 근골격계질환관련 어깨 작업자세 위험도 결정을 위한 사례적 접근)

  • Park, Dong Hyun;Hur, Kuk Kang
    • Journal of the Korean Society of Safety
    • /
    • v.28 no.1
    • /
    • pp.95-101
    • /
    • 2013
  • This study tried to develop a basis for quantitative index of working postures associated with WMSDs(Work-related Musculoskeletal Disorders) that could overcome realistic restriction during application of typical checklists for WMSDs evaluation. The baseline data for this study was obtained from automobile manufacturing company(A total of 603 jobs were observed). Specifically, data for shoulder postures was analyzed to have a better and more objective method in terms of job relevance than typical methods such as OWAS, RULA, and REBA. Major statistical tools were Clustering, Logistic regression and so on. The main results in this study could be summarized as follows; 1) The relationships between working postures and WMSDs symptoms at shoulder were statistically significant based on the results from logistic regression. 2) Based on clustering analysis, three levels for WMSDs risk at shoulder were produced for both flexion and abduction were statistically significant. Specific results were as follows; Shoulder flexion: low risk(< $37.7^{\circ}$), medium risk($37.7^{\circ}{\sim}70.0^{\circ}$), high risk(> $70.0^{\circ}$) Shoulder abduction: low risk(< $26.5^{\circ}$), medium risk($26.5^{\circ}{\sim}56.8^{\circ}$), high risk(> $56.8^{\circ}$). 3) The sensitivities on risk levels of shoulder flexion and abduction were 64.0% and 20.6% respectively while the specificities on risk levels of shoulder flexion and abduction were 99.1% and 99.3% respectively. The results showed that the data associated with shoulder postures in this study could provide a good basis for job evaluation of WMSDs at shoulder. Specifically, this evaluation methodology was different from the methods usually used at WMSDs study since it tried to be based on direct job relevance from real working situation. Further evaluation for other body parts as well as shoulder would provide more stability and reliability in WMSDs evaluation study.

Classification of Wind Sector in Pohang Region Using Similarity of Time-Series Wind Vectors (시계열 풍속벡터의 유사성을 이용한 포항지역 바람권역 분류)

  • Kim, Hyun-Goo;Kim, Jinsol;Kang, Yong-Heack;Park, Hyeong-Dong
    • Journal of the Korean Solar Energy Society
    • /
    • v.36 no.1
    • /
    • pp.11-18
    • /
    • 2016
  • The local wind systems in the Pohang region were categorized into wind sectors. Still, thorough knowledge of wind resource assessment, wind environment analysis, and atmospheric environmental impact assessment was required since the region has outstanding wind resources, it is located on the path of typhoon, and it has large-scale atmospheric pollution sources. To overcome the resolution limitation of meteorological dataset and problems of categorization criteria of the preceding studies, the high-resolution wind resource map of the Korea Institute of Energy Research was used as time-series meteorological data; the 2-step method of determining the clustering coefficient through hierarchical clustering analysis and subsequently categorizing the wind sectors through non-hierarchical K-means clustering analysis was adopted. The similarity of normalized time-series wind vector was proposed as the Euclidean distance. The meteor-statistical characteristics of the mean vector wind distribution and meteorological variables of each wind sector were compared. The comparison confirmed significant differences among wind sectors according to the terrain elevation, mean wind speed, Weibull shape parameter, etc.

Finding Genes Discriminating Smokers from Non-smokers by Applying a Growing Self-organizing Clustering Method to Large Airway Epithelium Cell Microarray Data

  • Shahdoust, Maryam;Hajizadeh, Ebrahim;Mozdarani, Hossein;Chehrei, Ali
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.14 no.1
    • /
    • pp.111-116
    • /
    • 2013
  • Background: Cigarette smoking is the major risk factor for development of lung cancer. Identification of effects of tobacco on airway gene expression may provide insight into the causes. This research aimed to compare gene expression of large airway epithelium cells in normal smokers (n=13) and non-smokers (n=9) in order to find genes which discriminate the two groups and assess cigarette smoking effects on large airway epithelium cells.Materials and Methods: Genes discriminating smokers from non-smokers were identified by applying a neural network clustering method, growing self-organizing maps (GSOM), to microarray data according to class discrimination scores. An index was computed based on differentiation between each mean of gene expression in the two groups. This clustering approach provided the possibility of comparing thousands of genes simultaneously. Results: The applied approach compared the mean of 7,129 genes in smokers and non-smokers simultaneously and classified the genes of large airway epithelium cells which had differently expressed in smokers comparing with non-smokers. Seven genes were identified which had the highest different expression in smokers compared with the non-smokers group: NQO1, H19, ALDH3A1, AKR1C1, ABHD2, GPX2 and ADH7. Most (NQO1, ALDH3A1, AKR1C1, H19 and GPX2) are known to be clinically notable in lung cancer studies. Furthermore, statistical discriminate analysis showed that these genes could classify samples in smokers and non-smokers correctly with 100% accuracy. With the performed GSOM map, other nodes with high average discriminate scores included genes with alterations strongly related to the lung cancer such as AKR1C3, CYP1B1, UCHL1 and AKR1B10. Conclusions: This clustering by comparing expression of thousands of genes at the same time revealed alteration in normal smokers. Most of the identified genes were strongly relevant to lung cancer in the existing literature. The genes may be utilized to identify smokers with increased risk for lung cancer. A large sample study is now recommended to determine relations between the genes ABHD2 and ADH7 and smoking.

Classification of Forest Cover Types in the Baekdudaegan, South Korea

  • Chung, Sang Hoon;Lee, Sang Tae
    • Journal of Forest and Environmental Science
    • /
    • v.37 no.4
    • /
    • pp.269-279
    • /
    • 2021
  • This study was carried out to introduce the forest cover types of the Baekdudaegan inhabiting the number of native tree species. In order to understand the vegetation distribution characteristics of the Baekdudaegan, a vegetation survey was conducted on the major 20 mountains of the Baekdudaegan. The vegetation data were collected from 3,959 sample points by the point-centered quarter method. Each mountain was classified into 4-7 forests by using various multivariate statistical methods such as cluster analysis, indicator species analysis, multiple discriminant analysis, and species composition analysis. The forests were classified mainly according to the relative abundance of Quercus mongolica. There was a total of 111 classified forests and these forests were integrated into the following nine forest cover types using the percentage similarity index and by clustering according to vegetation type: 1) Mongolian oak, 2) Mongolian oak and other deciduous, 3) Oaks (Mixed Quercus spp.), 4) Korean red pine, 5) Korean red pine and oaks, 6) ash, 7) mixed mesophytic, 8) subalpine zone coniferous, and 9) miscellaneous forest. Forests grouped within the subalpine zone coniferous and miscellaneous classifications were characterized by similar environmental conditions and those forests that did not fit in any other category, respectively.

Word Image Decomposition from Image Regions in Document Images using Statistical Analyses (문서 영상의 그림 영역에서 통계적 분석을 이용한 단어 영상 추출)

  • Jeong, Chang-Bu;Kim, Soo-Hyung
    • The KIPS Transactions:PartB
    • /
    • v.13B no.6 s.109
    • /
    • pp.591-600
    • /
    • 2006
  • This paper describes the development and implementation of a algorithm to decompose word images from image regions mixed text/graphics in document images using statistical analyses. To decompose word images from image regions, the character components need to be separated from graphic components. For this process, we propose a method to separate them with an analysis of box-plot using a statistics of structural components. An accuracy of this method is not sensitive to the changes of images because the criterion of separation is defined by the statistics of components. And then the character regions are determined by analyzing a local crowdedness of the separated character components. finally, we devide the character regions into text lines and word images using projection profile analysis, gap clustering, special symbol detection, etc. The proposed system could reduce the influence resulted from the changes of images because it uses the criterion based on the statistics of image regions. Also, we made an experiment with the proposed method in document image processing system for keyword spotting and showed the necessity of studying for the proposed method.

Noisy Band Removal Using Band Correlation in Hyperspectral lmages

  • Huan, Nguyen Van;Kim, Hak-Il
    • Korean Journal of Remote Sensing
    • /
    • v.25 no.3
    • /
    • pp.263-270
    • /
    • 2009
  • Noise band removal is a crucial step before spectral matching since the noise bands can distort the typical shape of spectral reflectance, leading to degradation on the matching results. This paper proposes a statistical noise band removal method for hyperspectral data using the correlation coefficient between two bands. The correlation coefficient measures the strength and direction of a linear relationship between two random variables. Considering each band of the hyperspectral data as a random variable, the correlation between two signal bands is high; existence of a noisy band will produce a low correlation due to ill-correlativeness and undirected ness. The unsupervised k-nearest neighbor clustering method is implemented in accordance with three well-accepted spectral matching measures, namely ED, SAM and SID in order to evaluate the validation of the proposed method. This paper also proposes a hierarchical scheme of combining those measures. Finally, a separability assessment based on the between-class and the within-class scatter matrices is followed to evaluate the applicability of the proposed noise band removal method. Also, the paper brings out a comparison for spectral matching measures. The experimental results conducted on a 228-band hyperspectral data show that while the SAM measure is rather resistant, the performance of SID measure is more sensitive to noise.

A Study on the Triphone Replacement in a Speech Recognition System with DMS Phoneme Models

  • Lee, Gang-Seong
    • The Journal of the Acoustical Society of Korea
    • /
    • v.18 no.3E
    • /
    • pp.21-25
    • /
    • 1999
  • This paper proposes methods that replace a missing triphone with a new one selected or created by existing triphones, and compares the results. The recognition system uses DMS (Dynamic Multisection) model for acoustic modeling. DMS is one of the statistical recognition techniques proper to a small - or mid - size vocabulary system, while HMM (Hidden Markov Model) is a probabilistic technique suitable for a middle or large system. Accordingly, it is reasonable to use an effective algorithm that is proper to DMS, rather than using a complicated method like a polyphone clustering technique employed in HMM-based systems. In this paper, four methods of filling missing triphones are presented. The result shows that a proposed replacing algorithm works almost as well as if all the necessary triphones existed. The experiments are performed on the 500+ word DMS speech recognizer.

  • PDF

Segmentation of Movie Consumption : An Application of Latent Class Analysis to Korean Film Industry (잠재계층분석기법(Latent Class Analysis)을 활용한 영화 소비자 세분화에 관한 연구)

  • Koo, Kay-Ryung;Lee, Jang-Hyuk
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.36 no.4
    • /
    • pp.161-184
    • /
    • 2011
  • As movie demands become more and more diversified, it is necessary for movie related firms to segment a whole heterogeneous market into a number of small homogeneous markets in order to identify the specific needs of consumer groups. Relevant market segmentation helps them to develop valuable offer to target segments through effective marketing planning. In this article, we introduce various segmentation methods and compare their advantages and disadvantages. In particular, we analyze "2009~2010 consumer survey data of Korean Film Industry" by using Latent Class Analysis(LCA), a statistical segmentation method which identifies exclusive set of latent classes based on consumers' responses to an observed categorical and numerical variables. It is applied PROC LCA, a new SAS procedure for conducting LCA and finally get the result of 11 distinctive clusters showing unique characteristics on their buying behaviors.