• 제목/요약/키워드: data reduction

검색결과 6,219건 처리시간 0.042초

Performance evaluation of principal component analysis for clustering problems

  • Kim, Jae-Hwan;Yang, Tae-Min;Kim, Jung-Tae
    • Journal of Advanced Marine Engineering and Technology
    • /
    • 제40권8호
    • /
    • pp.726-732
    • /
    • 2016
  • Clustering analysis is widely used in data mining to classify data into categories on the basis of their similarity. Through the decades, many clustering techniques have been developed, including hierarchical and non-hierarchical algorithms. In gene profiling problems, because of the large number of genes and the complexity of biological networks, dimensionality reduction techniques are critical exploratory tools for clustering analysis of gene expression data. Recently, clustering analysis of applying dimensionality reduction techniques was also proposed. PCA (principal component analysis) is a popular methd of dimensionality reduction techniques for clustering problems. However, previous studies analyzed the performance of PCA for only full data sets. In this paper, to specifically and robustly evaluate the performance of PCA for clustering analysis, we exploit an improved FCBF (fast correlation-based filter) of feature selection methods for supervised clustering data sets, and employ two well-known clustering algorithms: k-means and k-medoids. Computational results from supervised data sets show that the performance of PCA is very poor for large-scale features.

KMTNet 실시간 자료처리 파이프라인 개발 (DEVELOPMENT OF REAL-TIME DATA REDUCTION PIPELINE FOR KMTNet)

  • 김동진;이충욱;김승리;박병곤
    • 천문학논총
    • /
    • 제28권1호
    • /
    • pp.1-6
    • /
    • 2013
  • Real-time data reduction pipeline for the Korea Microlensing Telescope Network (KMTNet) was developed by Korea Astronomy and Space Science Institute (KASI). The main goal of the data reduction pipeline is to find variable objects and to record their light variation from the large amount of observation data of about 200 GB per night per site. To achieve the goal we adopt three strategic implementations: precision pointing of telescope using the cross correlation correction for target fields, realtime data transferring using kernel-level file handling and high speed network, and segment data processing architecture using the Sun-Grid engine. We tested performance of the pipeline using simulated data which represent the similar circumstance to CTIO (Cerro Tololo Inter-American Observatory), and we have found that it takes about eight hours for whole processing of one-night data. Therefore we conclude that the pipeline works without problem in real-time if the network speed is high enough, e.g., as high as in CTIO.

FT-ICR 질량분석기의 신호 충실성 향상 (The Improvement in Signal Integrity of FT-ICR MS)

  • 김승용;김석윤;김현식
    • 전기학회논문지
    • /
    • 제60권1호
    • /
    • pp.201-204
    • /
    • 2011
  • For efficient noise reduction in a Fourier transform ion cyclotron resonance (FT-ICR) mass spectrum, a new algorithm was proposed. The suggested algorithm reduces white and electrical noise, and it improves signal-to-noise ratio. This algorithm has been optimized to reduce the noise more efficiently using the traces of signal level. The algorithm has been efficiently combined with derivative window to improve the resolution as well S/N. Time domain data was corrected for DC voltage interference. $t^n$ window was applied in time domain data to improved the resolution. However, $t^n$ window can improve the signal resolution, it will also increase the noise level in frequency domain. Therefore, newly developed noise reduction algorithm will be applied to make a balance between resolving power and S/N ratio for magnitude mode. The trace algorithm can determine the current data point with several data points (mean, past data, calculated past data). In the current calculations, we assumed data points with S/N ratio more than 3 were considered as signal data points. After the windowing and noise reduction, both resolution and signal-to-noise ratio were improved. This algorithm is applicable more efficiently to frequency dependent noise and large size data.

스트리밍 데이터에 대한 최소제곱오차해를 통한 점층적 선형 판별 분석 기법 (Incremental Linear Discriminant Analysis for Streaming Data Using the Minimum Squared Error Solution)

  • 이경훈;박정희
    • 정보과학회 논문지
    • /
    • 제45권1호
    • /
    • pp.69-75
    • /
    • 2018
  • 시간에 따라 순차적으로 들어오는 스트리밍 데이터에서는 전체 데이터 셋을 한꺼번에 모두 이용하는 배치 학습에 기반한 차원축소 기법을 적용하기 어렵다. 따라서 스트리밍 데이터에 적용하기 위한 점층적 차원 감소 방법이 연구되어왔다. 이 논문에서는 최소제곱오차해를 통한 점층적 선형 판별 분석법을 제안한다. 제안 방법은 분산행렬을 직접 구하지 않고 새로 들어오는 샘플의 정보를 이용하여 차원 축소를 위한 사영 방향을 점층적으로 업데이트한다. 실험 결과는 이전에 제안된 점층적 차원축소 알고리즘과 비교하여 이 논문에서 제안한 방법이 더 효과적인 방법임을 입증한다.

A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest

  • Aydadenta, Husna;Adiwijaya, Adiwijaya
    • Journal of Information Processing Systems
    • /
    • 제14권5호
    • /
    • pp.1167-1175
    • /
    • 2018
  • Microarray data plays an essential role in diagnosing and detecting cancer. Microarray analysis allows the examination of levels of gene expression in specific cell samples, where thousands of genes can be analyzed simultaneously. However, microarray data have very little sample data and high data dimensionality. Therefore, to classify microarray data, a dimensional reduction process is required. Dimensional reduction can eliminate redundancy of data; thus, features used in classification are features that only have a high correlation with their class. There are two types of dimensional reduction, namely feature selection and feature extraction. In this paper, we used k-means algorithm as the clustering approach for feature selection. The proposed approach can be used to categorize features that have the same characteristics in one cluster, so that redundancy in microarray data is removed. The result of clustering is ranked using the Relief algorithm such that the best scoring element for each cluster is obtained. All best elements of each cluster are selected and used as features in the classification process. Next, the Random Forest algorithm is used. Based on the simulation, the accuracy of the proposed approach for each dataset, namely Colon, Lung Cancer, and Prostate Tumor, achieved 85.87%, 98.9%, and 89% accuracy, respectively. The accuracy of the proposed approach is therefore higher than the approach using Random Forest without clustering.

앙상블 접근법을 이용한 반감독 차원 감소 방법 (A Semi-supervised Dimension Reduction Method Using Ensemble Approach)

  • 박정희
    • 정보처리학회논문지D
    • /
    • 제19D권2호
    • /
    • pp.147-150
    • /
    • 2012
  • 클래스들 간의 거리를 최대화시키는 사영 방향을 구하는 감독차원감소 방법인 선형판별분석법(LDA)은 클래스 정보를 가진 데이터의 수가 매우 적을 때 성능이 급격히 저하되는 경향이 있다. 이러한 경우 상대적으로 저렴한 비용으로 얻을 수 있는 클래스 라벨 정보가 없는 데이터를 활용할 수 있는 반감독 차원 감소법이 사용될 수 있다. 그러나 통계적 차원 감소법에서 흔히 사용되는 행렬연산은 많은 양의 데이터를 사용하는데 메모리와 처리시간에서 한계가 있고, 적은 수의 라벨드 데이터(labeled data)에 비해 너무나 많은 언라벨드 데이터(unlabeled data)의 사용은 처리 시간의 증가에 비해 오히려 성능감소를 가져올 수 있다. 이러한 문제들을 극복하기 위해 앙상블 접근법을 이용한 반감독 차원 감소 방법을 제안한다. 문서분류 문제에서의 실험결과를 통해 제안한 방법의 성능을 입증한다.

DR-LSTM: Dimension reduction based deep learning approach to predict stock price

  • Ah-ram Lee;Jae Youn Ahn;Ji Eun Choi;Kyongwon Kim
    • Communications for Statistical Applications and Methods
    • /
    • 제31권2호
    • /
    • pp.213-234
    • /
    • 2024
  • In recent decades, increasing research attention has been directed toward predicting the price of stocks in financial markets using deep learning methods. For instance, recurrent neural network (RNN) is known to be competitive for datasets with time-series data. Long short term memory (LSTM) further improves RNN by providing an alternative approach to the gradient loss problem. LSTM has its own advantage in predictive accuracy by retaining memory for a longer time. In this paper, we combine both supervised and unsupervised dimension reduction methods with LSTM to enhance the forecasting performance and refer to this as a dimension reduction based LSTM (DR-LSTM) approach. For a supervised dimension reduction method, we use methods such as sliced inverse regression (SIR), sparse SIR, and kernel SIR. Furthermore, principal component analysis (PCA), sparse PCA, and kernel PCA are used as unsupervised dimension reduction methods. Using datasets of real stock market index (S&P 500, STOXX Europe 600, and KOSPI), we present a comparative study on predictive accuracy between six DR-LSTM methods and time series modeling.

Development of Acoustic Emission Monitoring System for Fault Detection of Thermal Reduction Reactor

  • Pakk, Gee-Young;Yoon, Ji-Sup;Park, Byung-Suk;Hong, Dong-Hee;Kim, Young-Hwan
    • Nuclear Engineering and Technology
    • /
    • 제35권1호
    • /
    • pp.25-34
    • /
    • 2003
  • The research on the development of the fault monitoring system for the thermal reduction reactor has been performed preliminarily in order to support the successful operation of the thermal reduction reactor. The final task of the development of the fault monitoring system is to assure the integrity of the thermal$_3$ reduction reactor by the acoustic emission (AE) method. The objectives of this paper are to identify and characterize the fault-induced signals for the discrimination of the various AE signals acquired during the reactor operation. The AE data acquisition and analysis system was constructed and applied to the fault monitoring of the small- scale reduction reactor, Through the series of experiments, the various signals such as background noise, operating signals, and fault-induced signals were measured and their characteristics were identified, which will be used in the signal discrimination for further application to full-scale thermal reduction reactor.

건설 공사장 간이 소음 예측 프로그램 개발 (Development of Noise Prediction Program in Construction Sites)

  • 김하근;주시웅
    • 한국소음진동공학회:학술대회논문집
    • /
    • 한국소음진동공학회 2007년도 춘계학술대회논문집
    • /
    • pp.1157-1161
    • /
    • 2007
  • A construction noise is the main reason for people's petition among the pollution. The purpose of this study is to develop the noise prediction program to see the level of the noise on the construction site more accurately. For this purpose, the database of the power level on the various equipments was made. The noise reduction by distance and the noise reduction by diffraction of barrier were mainly considered and calculated. The simple noise prediction program will provide the information about proper height and length of the potable barrier which satisfies noise criteria of the construction sites from a construction planning stage. To investigate the reliability of this program, the predicted data was compared with the measured data. An average of difference between measured data and predicted data is 1.3 dB(A) and a coefficient of correlation is about 0.95.

  • PDF

테스트 데이터와 전력소비 단축을 위한 저비용 SOC 테스트 기법 (Low Cost SOC(System-On-a-Chip) Testing Method for Reduction of Test Data and Power Dissipation)

  • 허용민;인치호
    • 대한전자공학회논문지SD
    • /
    • 제41권12호
    • /
    • pp.83-90
    • /
    • 2004
  • 본 논문은 SOC의 테스트 데이터 압축과 전력소비를 단축시키기 위한 효율적인 스캔 테스트 방법을 제안한다. 제안된 테스트 방법은 deterministic 테스트 데이터와 그 출력응답을 분석하여 출력응답의 일부분이 차기에 입력될 테스트 데이터로 재사용될 수 있는지를 결정한다. 실험결과, 비압축된 deterministic 입력 테스트 데이터와 그 응답간에 높은 유사도가 있음을 알 수 있다. 제안된 테스트 방법은 ISCAS'89 벤치마크 회로를 대상으로 소요되는 클럭 시간을 기준으로 평균 29.4%의 전력소비단축과 69.7%의 테스트 데이터 압축을 가져온다.