• Title/Summary/Keyword: High Dimensionality Data

Search Result 122, Processing Time 0.026 seconds

Comprehensive review on Clustering Techniques and its application on High Dimensional Data

  • Alam, Afroj;Muqeem, Mohd;Ahmad, Sultan
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.6
    • /
    • pp.237-244
    • /
    • 2021
  • Clustering is a most powerful un-supervised machine learning techniques for division of instances into homogenous group, which is called cluster. This Clustering is mainly used for generating a good quality of cluster through which we can discover hidden patterns and knowledge from the large datasets. It has huge application in different field like in medicine field, healthcare, gene-expression, image processing, agriculture, fraud detection, profitability analysis etc. The goal of this paper is to explore both hierarchical as well as partitioning clustering and understanding their problem with various approaches for their solution. Among different clustering K-means is better than other clustering due to its linear time complexity. Further this paper also focused on data mining that dealing with high-dimensional datasets with their problems and their existing approaches for their relevancy

Life Satisfaction Scale for Elderly : Revisited (구조적 차원성 탐색을 통한 '노인 생활 만족도 척도'의 재발견: 최성재의 '노인 생활 만족도 척도'를 중심으로)

  • Choi, Hye-Ji;Lee, Young-Boon
    • Korean Journal of Social Welfare
    • /
    • v.58 no.3
    • /
    • pp.27-49
    • /
    • 2006
  • The purpose of the present study was to investigate dimensionality and psychometric properties of identified theoretical constructs of the 'Life Satisfaction Scale for Elderly(LSSE)', which was developed by Choi, Sung-Jae in 1986. Data was obtained from 'The survey of health and welfare status of the elderly aged 65 or older in Chung-Choo city'. The subjects were 275 elderly. Results showed that LSSE had a multi-dimensional structure with three theoretical constructs. Each theoretical construct was named as 'positive affect and subjective satisfaction', 'negative self image and affect', and 'self-value'. Three theoretical constructs had high levels of reliability and validity based on internal construct. 'Positive affect and subjective satisfaction' and 'negative self image and affect' showed high levels of convergent and discriminant validity. 'Self-value' had a high level of convergent validity but acceptable level of discriminant validity. Results of this study revealed that there was a difference in theoretical dimensionality of LSSE between this study and Choi's study, which explained the dimensionality of LSSE as a single dimension. However, the result of this study regarding theoretical dimensionality supported findings from existing studies which insisted that life satisfaction had a multi-dimensional structure.

  • PDF

Speaker Identification Using GMM Based on Local Fuzzy PCA (국부 퍼지 클러스터링 PCA를 갖는 GMM을 이용한 화자 식별)

  • Lee, Ki-Yong
    • Speech Sciences
    • /
    • v.10 no.4
    • /
    • pp.159-166
    • /
    • 2003
  • To reduce the high dimensionality required for training of feature vectors in speaker identification, we propose an efficient GMM based on local PCA with Fuzzy clustering. The proposed method firstly partitions the data space into several disjoint clusters by fuzzy clustering, and then performs PCA using the fuzzy covariance matrix in each cluster. Finally, the GMM for speaker is obtained from the transformed feature vectors with reduced dimension in each cluster. Compared to the conventional GMM with diagonal covariance matrix, the proposed method needs less storage and shows faster result, under the same performance.

  • PDF

Study on Failure Classification of Missile Seekers Using Inspection Data from Production and Manufacturing Phases (생산 및 제조 단계의 검사 데이터를 이용한 유도탄 탐색기의 고장 분류 연구)

  • Ye-Eun Jeong;Kihyun Kim;Seong-Mok Kim;Youn-Ho Lee;Ji-Won Kim;Hwa-Young Yong;Jae-Woo Jung;Jung-Won Park;Yong Soo Kim
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.47 no.2
    • /
    • pp.30-39
    • /
    • 2024
  • This study introduces a novel approach for identifying potential failure risks in missile manufacturing by leveraging Quality Inspection Management (QIM) data to address the challenges presented by a dataset comprising 666 variables and data imbalances. The utilization of the SMOTE for data augmentation and Lasso Regression for dimensionality reduction, followed by the application of a Random Forest model, results in a 99.40% accuracy rate in classifying missiles with a high likelihood of failure. Such measures enable the preemptive identification of missiles at a heightened risk of failure, thereby mitigating the risk of field failures and enhancing missile life. The integration of Lasso Regression and Random Forest is employed to pinpoint critical variables and test items that significantly impact failure, with a particular emphasis on variables related to performance and connection resistance. Moreover, the research highlights the potential for broadening the scope of data-driven decision-making within quality control systems, including the refinement of maintenance strategies and the adjustment of control limits for essential test items.

A Density Peak Clustering Algorithm Based on Information Bottleneck

  • Yongli Liu;Congcong Zhao;Hao Chao
    • Journal of Information Processing Systems
    • /
    • v.19 no.6
    • /
    • pp.778-790
    • /
    • 2023
  • Although density peak clustering can often easily yield excellent results, there is still room for improvement when dealing with complex, high-dimensional datasets. One of the main limitations of this algorithm is its reliance on geometric distance as the sole similarity measurement. To address this limitation, we draw inspiration from the information bottleneck theory, and propose a novel density peak clustering algorithm that incorporates this theory as a similarity measure. Specifically, our algorithm utilizes the joint probability distribution between data objects and feature information, and employs the loss of mutual information as the measurement standard. This approach not only eliminates the potential for subjective error in selecting similarity method, but also enhances performance on datasets with multiple centers and high dimensionality. To evaluate the effectiveness of our algorithm, we conducted experiments using ten carefully selected datasets and compared the results with three other algorithms. The experimental results demonstrate that our information bottleneck-based density peaks clustering (IBDPC) algorithm consistently achieves high levels of accuracy, highlighting its potential as a valuable tool for data clustering tasks.

An Efficient Content-Based High-Dimensional Index Structure for Image Data

  • Lee, Jang-Sun;Yoo, Jae-Soo;Lee, Seok-Hee;Kim, Myung-Joon
    • ETRI Journal
    • /
    • v.22 no.2
    • /
    • pp.32-42
    • /
    • 2000
  • The existing multi-dimensional index structures are not adequate for indexing higher-dimensional data sets. Although conceptually they can be extended to higher dimensionalities, they usually require time and space that grow exponentially with the dimensionality. In this paper, we analyze the existing index structures and derive some requirements of an index structure for content-based image retrieval. We also propose a new structure, for indexing large amount of point data in a high-dimensional space that satisfies the requirements. in order to justify the performance of the proposed structure, we compare the proposed structure with the existing index structures in various environments. We show, through experiments, that our proposed structure outperforms the existing structures in terms of retrieval time and storage overhead.

  • PDF

Impact of Instance Selection on kNN-Based Text Categorization

  • Barigou, Fatiha
    • Journal of Information Processing Systems
    • /
    • v.14 no.2
    • /
    • pp.418-434
    • /
    • 2018
  • With the increasing use of the Internet and electronic documents, automatic text categorization becomes imperative. Several machine learning algorithms have been proposed for text categorization. The k-nearest neighbor algorithm (kNN) is known to be one of the best state of the art classifiers when used for text categorization. However, kNN suffers from limitations such as high computation when classifying new instances. Instance selection techniques have emerged as highly competitive methods to improve kNN through data reduction. However previous works have evaluated those approaches only on structured datasets. In addition, their performance has not been examined over the text categorization domain where the dimensionality and size of the dataset is very high. Motivated by these observations, this paper investigates and analyzes the impact of instance selection on kNN-based text categorization in terms of various aspects such as classification accuracy, classification efficiency, and data reduction.

Efficient estimation and variable selection for partially linear single-index-coefficient regression models

  • Kim, Young-Ju
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.1
    • /
    • pp.69-78
    • /
    • 2019
  • A structured model with both single-index and varying coefficients is a powerful tool in modeling high dimensional data. It has been widely used because the single-index can overcome the curse of dimensionality and varying coefficients can allow nonlinear interaction effects in the model. For high dimensional index vectors, variable selection becomes an important question in the model building process. In this paper, we propose an efficient estimation and a variable selection method based on a smoothing spline approach in a partially linear single-index-coefficient regression model. We also propose an efficient algorithm for simultaneously estimating the coefficient functions in a data-adaptive lower-dimensional approximation space and selecting significant variables in the index with the adaptive LASSO penalty. The empirical performance of the proposed method is illustrated with simulated and real data examples.

Multifactor Dimensionality Reduction (MDR) Analysis to Detect Single Nucleotide Polymorphisms Associated with a Carcass Trait in a Hanwoo Population

  • Lee, Jea-Young;Kwon, Jae-Chul;Kim, Jong-Joo
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.21 no.6
    • /
    • pp.784-788
    • /
    • 2008
  • Studies to detect genes responsible for economic traits in farm animals have been performed using parametric linear models. A non-parametric, model-free approach using the 'expanded multifactor-dimensionality reduction (MDR) method' considering high dimensionalities of interaction effects between multiple single nucleotide polymorphisms (SNPs), was applied to identify interaction effects of SNPs responsible for carcass traits in a Hanwoo beef cattle population. Data were obtained from the Hanwoo Improvement Center, National Agricultural Cooperation Federation, Korea, and comprised 299 steers from 16 paternal half-sib proven sires that were delivered in Namwon or Daegwanryong livestock testing stations between spring of 2002 and fall of 2003. For each steer at approximately 722 days of age, the Longssimus dorsi muscle area (LMA) was measured after slaughter. Three functional SNPs (19_1, 18_4, 28_2) near the microsatellite marker ILSTS035 on BTA6, around which the QTL for meat quality were previously detected, were assessed. Application of the expanded MDR method revealed the best model with an interaction effect between the SNPs 19_1 and 28_2, while only one main effect of SNP19_1 was statistically significant for LMA (p<0.01) under a general linear mixed model. Our results suggest that the expanded MDR method better identifies interaction effects between multiple genes that are related to polygenic traits, and that the method is an alternative to the current model choices to find associations of multiple functional SNPs and/or their interaction effects with economic traits in livestock populations.

An Experimental Study on Smoothness Regularized LDA in Hyperspectral Data Classification (하이퍼스펙트럴 데이터 분류에서의 평탄도 LDA 규칙화 기법의 실험적 분석)

  • Park, Lae-Jeong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.20 no.4
    • /
    • pp.534-540
    • /
    • 2010
  • High dimensionality and highly correlated features are the major characteristics of hyperspectral data. Linear projections such as LDA and its variants have been used in extracting low-dimensional features from high-dimensional spectral data. Regularization of LDA has been introduced to alleviate the overfitting that often occurs in a small-sized training data set and leads to poor generalization performance. Among them, a smoothness regularized LDA seems to be effective in the feature extraction for hyperspectral data due to its capability of utilizing the high correlatedness. This paper studies the performance of the regularized LDA in hyperspectral data classification experimentally with varying conditions of the training data. In addition, a new dual smoothness regularized LDA is proposed and evaluated that makes use of both the spectral-domain and spatial-domain correlations between neighboring pixels.