• Title/Summary/Keyword: High Dimensionality Data

Search Result 121, Processing Time 0.026 seconds

Cancer-Subtype Classification Based on Gene Expression Data (유전자 발현 데이터를 이용한 암의 유형 분류 기법)

  • Cho Ji-Hoon;Lee Dongkwon;Lee Min-Young;Lee In-Beum
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.10 no.12
    • /
    • pp.1172-1180
    • /
    • 2004
  • Recently, the gene expression data, product of high-throughput technology, appeared in earnest and the studies related with it (so-called bioinformatics) occupied an important position in the field of biological and medical research. The microarray is a revolutionary technology which enables us to monitor several thousands of genes simultaneously and thus to gain an insight into the phenomena in the human body (e.g. the mechanism of cancer progression) at the molecular level. To obtain useful information from such gene expression measurements, it is essential to analyze the data with appropriate techniques. However the high-dimensionality of the data can bring about some problems such as curse of dimensionality and singularity problem of matrix computation, and hence makes it difficult to apply conventional data analysis methods. Therefore, the development of method which can effectively treat the data becomes a challenging issue in the field of computational biology. This research focuses on the gene selection and classification for cancer subtype discrimination based on gene expression (microarray) data.

Detection of Differentially Expressed Genes by Clustering Genes Using Class-Wise Averaged Data in Microarray Data

  • Kim, Seung-Gu
    • Communications for Statistical Applications and Methods
    • /
    • v.14 no.3
    • /
    • pp.687-698
    • /
    • 2007
  • A normal mixture model with which dependence between classes is incorporated is proposed in order to detect differentially expressed genes. Gene clustering approaches suffer from the high dimensional column of microarray expression data matrix which leads to the over-fit problem. Various methods are proposed to solve the problem. In this paper, use of simple averaging data within each class is proposed to overcome the various problems due to high dimensionality when the normal mixture model is fitted. Some experiments through simulated data set and real data set show its availability in actuality.

Effective Dimensionality Reduction of Payload-Based Anomaly Detection in TMAD Model for HTTP Payload

  • Kakavand, Mohsen;Mustapha, Norwati;Mustapha, Aida;Abdullah, Mohd Taufik
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.8
    • /
    • pp.3884-3910
    • /
    • 2016
  • Intrusion Detection System (IDS) in general considers a big amount of data that are highly redundant and irrelevant. This trait causes slow instruction, assessment procedures, high resource consumption and poor detection rate. Due to their expensive computational requirements during both training and detection, IDSs are mostly ineffective for real-time anomaly detection. This paper proposes a dimensionality reduction technique that is able to enhance the performance of IDSs up to constant time O(1) based on the Principle Component Analysis (PCA). Furthermore, the present study offers a feature selection approach for identifying major components in real time. The PCA algorithm transforms high-dimensional feature vectors into a low-dimensional feature space, which is used to determine the optimum volume of factors. The proposed approach was assessed using HTTP packet payload of ISCX 2012 IDS and DARPA 1999 dataset. The experimental outcome demonstrated that our proposed anomaly detection achieved promising results with 97% detection rate with 1.2% false positive rate for ISCX 2012 dataset and 100% detection rate with 0.06% false positive rate for DARPA 1999 dataset. Our proposed anomaly detection also achieved comparable performance in terms of computational complexity when compared to three state-of-the-art anomaly detection systems.

A Study on the Application of Natural Language Processing in Health Care Big Data: Focusing on Word Embedding Methods (보건의료 빅데이터에서의 자연어처리기법 적용방안 연구: 단어임베딩 방법을 중심으로)

  • Kim, Hansang;Chung, Yeojin
    • Health Policy and Management
    • /
    • v.30 no.1
    • /
    • pp.15-25
    • /
    • 2020
  • While healthcare data sets include extensive information about patients, many researchers have limitations in analyzing them due to their intrinsic characteristics such as heterogeneity, longitudinal irregularity, and noise. In particular, since the majority of medical history information is recorded in text codes, the use of such information has been limited due to the high dimensionality of explanatory variables. To address this problem, recent studies applied word embedding techniques, originally developed for natural language processing, and derived positive results in terms of dimensional reduction and accuracy of the prediction model. This paper reviews the deep learning-based natural language processing techniques (word embedding) and summarizes research cases that have used those techniques in the health care field. Then we finally propose a research framework for applying deep learning-based natural language process in the analysis of domestic health insurance data.

The Impact of the PCA Dimensionality Reduction for CNN based Hyperspectral Image Classification (CNN 기반 초분광 영상 분류를 위한 PCA 차원축소의 영향 분석)

  • Kwak, Taehong;Song, Ahram;Kim, Yongil
    • Korean Journal of Remote Sensing
    • /
    • v.35 no.6_1
    • /
    • pp.959-971
    • /
    • 2019
  • CNN (Convolutional Neural Network) is one representative deep learning algorithm, which can extract high-level spatial and spectral features, and has been applied for hyperspectral image classification. However, one significant drawback behind the application of CNNs in hyperspectral images is the high dimensionality of the data, which increases the training time and processing complexity. To address this problem, several CNN based hyperspectral image classification studies have exploited PCA (Principal Component Analysis) for dimensionality reduction. One limitation to this is that the spectral information of the original image can be lost through PCA. Although it is clear that the use of PCA affects the accuracy and the CNN training time, the impact of PCA for CNN based hyperspectral image classification has been understudied. The purpose of this study is to analyze the quantitative effect of PCA in CNN for hyperspectral image classification. The hyperspectral images were first transformed through PCA and applied into the CNN model by varying the size of the reduced dimensionality. In addition, 2D-CNN and 3D-CNN frameworks were applied to analyze the sensitivity of the PCA with respect to the convolution kernel in the model. Experimental results were evaluated based on classification accuracy, learning time, variance ratio, and training process. The size of the reduced dimensionality was the most efficient when the explained variance ratio recorded 99.7%~99.8%. Since the 3D kernel had higher classification accuracy in the original-CNN than the PCA-CNN in comparison to the 2D-CNN, the results revealed that the dimensionality reduction was relatively less effective in 3D kernel.

A Cell-based Clustering Method for Large High-dimensional Data in Data Mining (데이타마이닝에서 고차원 대용량 데이타를 위한 셀-기반 클러스터 링 방법)

  • Jin, Du-Seok;Chang, Jae-Woo
    • Journal of KIISE:Databases
    • /
    • v.28 no.4
    • /
    • pp.558-567
    • /
    • 2001
  • Recently, data mining applications require a large amount of high-dimensional data Most algorithms for data mining applications however, do not work efficiently of high-dimensional large data because of the so-called curse of dimensionality[1] and the limitation of available memory. To overcome these problems, this paper proposes a new cell-based clustering which is more efficient than the existing algorithms for high-dimensional large data, Our clustering method provides a cell construction algorithm for dealing with high-dimensional large data and a index structure based of filtering .We do performance comparison of our cell-based clustering method with the CLIQUE method in terms of clustering time, precision, and retrieval time. Finally, the results from our experiment show that our cell-based clustering method outperform the CLIQUE method.

  • PDF

Comparative Analysis of Dimensionality Reduction Techniques for Advanced Ransomware Detection with Machine Learning (기계학습 기반 랜섬웨어 공격 탐지를 위한 효과적인 특성 추출기법 비교분석)

  • Kim Han Seok;Lee Soo Jin
    • Convergence Security Journal
    • /
    • v.23 no.1
    • /
    • pp.117-123
    • /
    • 2023
  • To detect advanced ransomware attacks with machine learning-based models, the classification model must train learning data with high-dimensional feature space. And in this case, a 'curse of dimension' phenomenon is likely to occur. Therefore, dimensionality reduction of features must be preceded in order to increase the accuracy of the learning model and improve the execution speed while avoiding the 'curse of dimension' phenomenon. In this paper, we conducted classification of ransomware by applying three machine learning models and two feature extraction techniques to two datasets with extremely different dimensions of feature space. As a result of the experiment, the feature dimensionality reduction techniques did not significantly affect the performance improvement in binary classification, and it was the same even when the dimension of featurespace was small in multi-class clasification. However, when the dataset had high-dimensional feature space, LDA(Linear Discriminant Analysis) showed quite excellent performance.

Using noise filtering and sufficient dimension reduction method on unstructured economic data (노이즈 필터링과 충분차원축소를 이용한 비정형 경제 데이터 활용에 대한 연구)

  • Jae Keun Yoo;Yujin Park;Beomseok Seo
    • The Korean Journal of Applied Statistics
    • /
    • v.37 no.2
    • /
    • pp.119-138
    • /
    • 2024
  • Text indicators are increasingly valuable in economic forecasting, but are often hindered by noise and high dimensionality. This study aims to explore post-processing techniques, specifically noise filtering and dimensionality reduction, to normalize text indicators and enhance their utility through empirical analysis. Predictive target variables for the empirical analysis include monthly leading index cyclical variations, BSI (business survey index) All industry sales performance, BSI All industry sales outlook, as well as quarterly real GDP SA (seasonally adjusted) growth rate and real GDP YoY (year-on-year) growth rate. This study explores the Hodrick and Prescott filter, which is widely used in econometrics for noise filtering, and employs sufficient dimension reduction, a nonparametric dimensionality reduction methodology, in conjunction with unstructured text data. The analysis results reveal that noise filtering of text indicators significantly improves predictive accuracy for both monthly and quarterly variables, particularly when the dataset is large. Moreover, this study demonstrated that applying dimensionality reduction further enhances predictive performance. These findings imply that post-processing techniques, such as noise filtering and dimensionality reduction, are crucial for enhancing the utility of text indicators and can contribute to improving the accuracy of economic forecasts.

On Combining Genetic Algorithm (GA) and Wavelet for High Dimensional Data Reduction

  • Liu, Zhengjun;Wang, Changyao;Zhang, Jixian;Yan, Qin
    • Proceedings of the KSRS Conference
    • /
    • 2003.11a
    • /
    • pp.1272-1274
    • /
    • 2003
  • In this paper, we present a new algorithm for high dimensional data reduction based on wavelet decomposition and Genetic Algorithm (GA). Comparative results show the superiority of our algorithm for dimensionality reduction and accuracy improvement.

  • PDF

Multifactor-Dimensionality Reduction in the Presence of Missing Observations

  • Chung, Yu-Jin;Lee, Seung-Yeoun;Park, Tae-Sung
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2005.11a
    • /
    • pp.31-36
    • /
    • 2005
  • An identification and characterization of susceptibility genes for common complex multifactorial diseases is a challengeable task, in which the effect of single genetic variation will be likely dependent on other genetic variations(gene-gene interaction) and environmental factors (gene-environment interaction). To address is issue, the multifactor dimensionality reduction (MDR) has been proposed and implemented by Ritchie et al. (2001), Moore et al. (2002), Hahn et al.(2003) and Ritchie et al. (2003). With MDR, multilocus genotypes effectively reduce the dimension of genotype predictors from n to one, which improves the identification of polymorphism combinations associated with disease risk. However, MDR cannot handle missing observations appropriately, in which missing observation is treated as an additional genotype category. This approach may suffer from a sparseness problem since when high-order interactions are considered, an additional missing category would make the contingency table cells more sparse. We propose a new MDR approach with minimum loss of sample sizes by considering missing data over all possible multifactor classes. We evaluate the proposed MDR by using the prediction errors and cross validation consistency.

  • PDF