• Title/Summary/Keyword: High-Dimensional Data

Search Result 1,545, Processing Time 0.031 seconds

A Study on Selecting Principle Component Variables Using Adaptive Correlation (적응적 상관도를 이용한 주성분 변수 선정에 관한 연구)

  • Ko, Myung-Sook
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.3
    • /
    • pp.79-84
    • /
    • 2021
  • A feature extraction method capable of reflecting features well while mainaining the properties of data is required in order to process high-dimensional data. The principal component analysis method that converts high-level data into low-dimensional data and express high-dimensional data with fewer variables than the original data is a representative method for feature extraction of data. In this study, we propose a principal component analysis method based on adaptive correlation when selecting principal component variables in principal component analysis for data feature extraction when the data is high-dimensional. The proposed method analyzes the principal components of the data by adaptively reflecting the correlation based on the correlation between the input data. I want to exclude them from the candidate list. It is intended to analyze the principal component hierarchy by the eigen-vector coefficient value, to prevent the selection of the principal component with a low hierarchy, and to minimize the occurrence of data duplication inducing data bias through correlation analysis. Through this, we propose a method of selecting a well-presented principal component variable that represents the characteristics of actual data by reducing the influence of data bias when selecting the principal component variable.

An Efficient Processing of Continuous Range Queries on High-Dimensional Spatial Data (고차원 공간 데이터를 위한 연속 범위 질의의 효율적인 처리)

  • Jang, Su-Min;Yoo, Jae-Soo
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.13 no.6
    • /
    • pp.397-401
    • /
    • 2007
  • Recent applications on continuous queries on moving objects are extended quickly to various parts. These applications need not only 2-dimensional space data but also high-dimensional space data. If we use previous index for overlapped continuous range queries on high-dimensional space data, as the number of continuous range queries on a large number of moving objects becomes larger, their performance degrades significantly. We focus on stationary queries, non-exponential increase of storage cost and efficient processing time for large data sets. In this paper, to solve these problems, we present a novel query indexing method, denoted as PAB(Projected Attribute Bit)-based query index. We transfer information of high-dimensional continuous range query on each axis into one-dimensional bit lists by projecting technique. Also proposed query index supports incremental update for efficient query processing. Through various experiments, we show that our method outperforms the CES(containment-encoded squares)-based indexing method which is one of the most recent research.

Feature Extraction on High Dimensional Data Using Incremental PCA (점진적인 주성분분석기법을 이용한 고차원 자료의 특징 추출)

  • Kim Byung-Joo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.8 no.7
    • /
    • pp.1475-1479
    • /
    • 2004
  • High dimensional data requires efficient feature extraction techliques. Though PCA(Principal Component Analysis) is a famous feature extraction method it requires huge memory space and computational cost is high. In this paper we use incremental PCA for feature extraction on high dimensional data. Through experiment we show that proposed method is superior to APEX model.

Bayesian baseline-category logit random effects models for longitudinal nominal data

  • Kim, Jiyeong;Lee, Keunbaik
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.2
    • /
    • pp.201-210
    • /
    • 2020
  • Baseline-category logit random effects models have been used to analyze longitudinal nominal data. The models account for subject-specific variations using random effects. However, the random effects covariance matrix in the models needs to explain subject-specific variations as well as serial correlations for nominal outcomes. In order to satisfy them, the covariance matrix must be heterogeneous and high-dimensional. However, it is difficult to estimate the random effects covariance matrix due to its high dimensionality and positive-definiteness. In this paper, we exploit the modified Cholesky decomposition to estimate the high-dimensional heterogeneous random effects covariance matrix. Bayesian methodology is proposed to estimate parameters of interest. The proposed methods are illustrated with real data from the McKinney Homeless Research Project.

Effect of outliers on the variable selection by the regularized regression

  • Jeong, Junho;Kim, Choongrak
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.2
    • /
    • pp.235-243
    • /
    • 2018
  • Many studies exist on the influence of one or few observations on estimators in a variety of statistical models under the "large n, small p" setup; however, diagnostic issues in the regression models have been rarely studied in a high dimensional setup. In the high dimensional data, the influence of observations is more serious because the sample size n is significantly less than the number variables p. Here, we investigate the influence of observations on the least absolute shrinkage and selection operator (LASSO) estimates, suggested by Tibshirani (Journal of the Royal Statistical Society, Series B, 73, 273-282, 1996), and the influence of observations on selected variables by the LASSO in the high dimensional setup. We also derived an analytic expression for the influence of the k observation on LASSO estimates in simple linear regression. Numerical studies based on artificial data and real data are done for illustration. Numerical results showed that the influence of observations on the LASSO estimates and the selected variables by the LASSO in the high dimensional setup is more severe than that in the usual "large n, small p" setup.

Three-Dimensional Borehole Radar Modeling (3차원 시추공 레이다 모델링)

  • 예병주
    • Economic and Environmental Geology
    • /
    • v.33 no.1
    • /
    • pp.41-50
    • /
    • 2000
  • Geo-radar survey which has the advantage of high-resolution and relatively fast survey has been widely used for engineering and environmental problems. Three-dimensional effects have to be considered in the interpretation of geo-radar for high-resolution. However, there exists a trouble on the analysis of the three dimensional effects. To solve this problem an efficient three dimension numerical modeling algorithm is needed. Numerical radar modeling in three dimensional case requires large memory and long calculating time. In this paper, a finite difference method time domain solution to Maxwell's equations for simulating electromagnetic wave propagation in three dimensional media was developed to make economic algorithm which requires smaller memory and shorter calculating time. And in using boundary condition Liao absorption boundary. The numerical result of cross-hole radar survey for tunnel is compared with real data. The two results are well matched. To prove application to three dimensional analysis, the results with variation of tunnel's incident angle to survey cross-section and the result when the tunnel is parallel to the cross-section were examined. This algorithm is useful in various geo-radar survey and can give basic data to develop dat processing and inversion program.

  • PDF

A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest

  • Aydadenta, Husna;Adiwijaya, Adiwijaya
    • Journal of Information Processing Systems
    • /
    • v.14 no.5
    • /
    • pp.1167-1175
    • /
    • 2018
  • Microarray data plays an essential role in diagnosing and detecting cancer. Microarray analysis allows the examination of levels of gene expression in specific cell samples, where thousands of genes can be analyzed simultaneously. However, microarray data have very little sample data and high data dimensionality. Therefore, to classify microarray data, a dimensional reduction process is required. Dimensional reduction can eliminate redundancy of data; thus, features used in classification are features that only have a high correlation with their class. There are two types of dimensional reduction, namely feature selection and feature extraction. In this paper, we used k-means algorithm as the clustering approach for feature selection. The proposed approach can be used to categorize features that have the same characteristics in one cluster, so that redundancy in microarray data is removed. The result of clustering is ranked using the Relief algorithm such that the best scoring element for each cluster is obtained. All best elements of each cluster are selected and used as features in the classification process. Next, the Random Forest algorithm is used. Based on the simulation, the accuracy of the proposed approach for each dataset, namely Colon, Lung Cancer, and Prostate Tumor, achieved 85.87%, 98.9%, and 89% accuracy, respectively. The accuracy of the proposed approach is therefore higher than the approach using Random Forest without clustering.

Progression-Preserving Dimension Reduction for High-Dimensional Sensor Data Visualization

  • Yoon, Hyunjin;Shahabi, Cyrus;Winstein, Carolee J.;Jang, Jong-Hyun
    • ETRI Journal
    • /
    • v.35 no.5
    • /
    • pp.911-914
    • /
    • 2013
  • This letter presents Progression-Preserving Projection, a dimension reduction technique that finds a linear projection that maps a high-dimensional sensor dataset into a two- or three-dimensional subspace with a particularly useful property for visual exploration. As a demonstration of its effectiveness as a visual exploration and diagnostic means, we empirically evaluate the proposed technique over a dataset acquired from our own virtual-reality-enhanced ball-intercepting training system designed to promote the upper extremity movement skills of individuals recovering from stroke-related hemiparesis.

An Efficient Content-Based High-Dimensional Index Structure for Image Data

  • Lee, Jang-Sun;Yoo, Jae-Soo;Lee, Seok-Hee;Kim, Myung-Joon
    • ETRI Journal
    • /
    • v.22 no.2
    • /
    • pp.32-42
    • /
    • 2000
  • The existing multi-dimensional index structures are not adequate for indexing higher-dimensional data sets. Although conceptually they can be extended to higher dimensionalities, they usually require time and space that grow exponentially with the dimensionality. In this paper, we analyze the existing index structures and derive some requirements of an index structure for content-based image retrieval. We also propose a new structure, for indexing large amount of point data in a high-dimensional space that satisfies the requirements. in order to justify the performance of the proposed structure, we compare the proposed structure with the existing index structures in various environments. We show, through experiments, that our proposed structure outperforms the existing structures in terms of retrieval time and storage overhead.

  • PDF

PdR-Tree : An Efficient Indexing Technique for the improvement of search performance in High-Dimensional Data (PdR-트리 : 고차원 데이터의 검색 성능 향상을 위한 효율적인 인덱스 기법)

  • Joh, Beom-Seok;Park, Young-Bae
    • The KIPS Transactions:PartD
    • /
    • v.8D no.2
    • /
    • pp.145-153
    • /
    • 2001
  • The Pyramid-Technique is based on mapping n-dimensional space data into one-dimensional data and expressing it as B-tree ; and by solving the problem of search time complexity the pyramid technique also prevents the effect \"phenomenon of dimensional curse\" which is caused by treatment of hypercube range query in n-dimensional data space. The Spherical Pyramid-Technique applies the pyramid method’s space division strategy, uses spherical range query and improves the search performance to make it suitable for similarity search. However, depending on the size of data and change in dimensions, the two above technique demonstrate significantly inferior search performance for data sizes greater than one million and dimensions greater than sixteen. In this paper, we propose a new index-structured PdR-Tree to improve the search performance for high dimensional data such as multimedia data. Test results using simulation data as well as real data demonstrate that PdR-Tree surpasses both the Pyramid-Technique and Spherical Pyramid-Technique in terms of search performance.

  • PDF