• Title/Summary/Keyword: 고차원자료

Search Result 71, Processing Time 0.021 seconds

Multiple testing and its applications in high-dimension (고차원자료에서의 다중검정의 활용)

  • Jang, Woncheol
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.5
    • /
    • pp.1063-1076
    • /
    • 2013
  • The power of modern technology is opening a new era of big data. The size of the datasets affords us the opportunity to answer many open scientific questions but also presents some interesting challenges. High-dimensional data such as microarray are common in big data. In this paper, we give an overview of recent development of multiple testing including global and simultaneous testing and its applications to high-dimensional data.

A study on high dimensional large-scale data visualization (고차원 대용량 자료의 시각화에 대한 고찰)

  • Lee, Eun-Kyung;Hwang, Nayoung;Lee, Yoondong
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.6
    • /
    • pp.1061-1075
    • /
    • 2016
  • In this paper, we discuss various methods to visualize high dimensional large-scale data and review some issues associated with visualizing this type of data. High-dimensional data can be presented in a 2-dimensional space with a few selected important variables. We can visualize more variables with various aesthetic attributes in graphics or use the projection pursuit method to find an interesting low-dimensional view. For large-scale data, we discuss jittering and alpha blending methods that solve any problem with overlapping points. We also review the R package tabplot, scagnostics, and other R packages for interactive web application with visualization.

Network-based regularization for analysis of high-dimensional genomic data with group structure (그룹 구조를 갖는 고차원 유전체 자료 분석을 위한 네트워크 기반의 규제화 방법)

  • Kim, Kipoong;Choi, Jiyun;Sun, Hokeun
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.6
    • /
    • pp.1117-1128
    • /
    • 2016
  • In genetic association studies with high-dimensional genomic data, regularization procedures based on penalized likelihood are often applied to identify genes or genetic regions associated with diseases or traits. A network-based regularization procedure can utilize biological network information (such as genetic pathways and signaling pathways in genetic association studies) with an outstanding selection performance over other regularization procedures such as lasso and elastic-net. However, network-based regularization has a limitation because cannot be applied to high-dimension genomic data with a group structure. In this article, we propose to combine data dimension reduction techniques such as principal component analysis and a partial least square into network-based regularization for the analysis of high-dimensional genomic data with a group structure. The selection performance of the proposed method was evaluated by extensive simulation studies. The proposed method was also applied to real DNA methylation data generated from Illumina Innium HumanMethylation27K BeadChip, where methylation beta values of around 20,000 CpG sites over 12,770 genes were compared between 123 ovarian cancer patients and 152 healthy controls. This analysis was also able to indicate a few cancer-related genes.

Feature Extraction on High Dimensional Data Using Incremental PCA (점진적인 주성분분석기법을 이용한 고차원 자료의 특징 추출)

  • Kim Byung-Joo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.8 no.7
    • /
    • pp.1475-1479
    • /
    • 2004
  • High dimensional data requires efficient feature extraction techliques. Though PCA(Principal Component Analysis) is a famous feature extraction method it requires huge memory space and computational cost is high. In this paper we use incremental PCA for feature extraction on high dimensional data. Through experiment we show that proposed method is superior to APEX model.

Error reduction by adding artificial data in SOM (인공데이터첨가를 통한 SOM의 quantization error 감소)

  • Kim, Seung-Taek;Jo, Seong-Jun
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 2005.05a
    • /
    • pp.260-267
    • /
    • 2005
  • 자기조직화지도(Self Organizing Map, SOM)는 비지도 신경망으로서 고차원의 입력공간을 위상적관계를 유지시키면서 저차원으로 사영 시킬 수 있는 특징을 갖고 있다. SOM은 패턴인 식과 자료압축/재생 등 여러 분야에서 유용하게 활용될 수 있으며 특히 고차원 자료의 시각화 방법으로 많은 관심을 받고 있다. 본 연구에서는 SOM의 quantization error를 줄이기 위한 목적으로 인공데이터를 생성시켜 학습에 이용하는 방법을 제시한다. 이는 특히 데이터가 부족한 상황에서 SOM을 학습시켜야 할 때 유용하게 적용될 수 있을 것으로 기대된다.

  • PDF

Graphical method for evaluating the impact of influential observations in high-dimensional data (고차원 자료에서 영향점의 영향을 평가하기 위한 그래픽 방법)

  • Ahn, Sojin;Lee, Jae Eun;Jang, Dae-Heung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.6
    • /
    • pp.1291-1300
    • /
    • 2017
  • In the high-dimensional data, the number of variables is very larger than the number of observations. In this case, the impact of influential observations on regression coefficient estimates can be very large. Jang and Anderson-Cook (2017) suggested the LASSO influence plot. In this paper, we propose the LASSO influence plot, LASSO variable selection ranking plot, and three-dimensional LASSO influence plot as graphical methods for evaluating the impact of influential observations in high-dimensional data. With real two high-dimensional data examples, we apply these graphical methods as the regression diagnostics tools for finding influential observations. It has been found that we can obtain influential observations with by these graphical methods.

Current trends in high dimensional massive data analysis (고차원 대용량 자료분석의 현재 동향)

  • Jang, Woncheol;Kim, Gwangsu;Kim, Joungyoun
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.6
    • /
    • pp.999-1005
    • /
    • 2016
  • The advent of big data brings the opportunity to answer many open scientic questions but also presents some interesting challenges. Main features of contemporary datasets are the high dimensionality and massive sample size. In this paper, we give an overview of major challenges caused by these two features: (1) noise accumulation and spurious correlations in high dimensional data; (ii) computational scalability for massive data. We also provide applications of big data in various fields including forecast of disasters, digital humanities and sabermetrics.

Introduction to variational Bayes for high-dimensional linear and logistic regression models (고차원 선형 및 로지스틱 회귀모형에 대한 변분 베이즈 방법 소개)

  • Jang, Insong;Lee, Kyoungjae
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.3
    • /
    • pp.445-455
    • /
    • 2022
  • In this paper, we introduce existing Bayesian methods for high-dimensional sparse regression models and compare their performance in various simulation scenarios. Especially, we focus on the variational Bayes approach proposed by Ray and Szabó (2021), which enables scalable and accurate Bayesian inference. Based on simulated data sets from sparse high-dimensional linear regression models, we compare the variational Bayes approach with other Bayesian and frequentist methods. To check the practical performance of the variational Bayes in logistic regression models, a real data analysis is conducted using leukemia data set.

A study on bias effect of LASSO regression for model selection criteria (모형 선택 기준들에 대한 LASSO 회귀 모형 편의의 영향 연구)

  • Yu, Donghyeon
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.4
    • /
    • pp.643-656
    • /
    • 2016
  • High dimensional data are frequently encountered in various fields where the number of variables is greater than the number of samples. It is usually necessary to select variables to estimate regression coefficients and avoid overfitting in high dimensional data. A penalized regression model simultaneously obtains variable selection and estimation of coefficients which makes them frequently used for high dimensional data. However, the penalized regression model also needs to select the optimal model by choosing a tuning parameter based on the model selection criterion. This study deals with the bias effect of LASSO regression for model selection criteria. We numerically describes the bias effect to the model selection criteria and apply the proposed correction to the identification of biomarkers for lung cancer based on gene expression data.

A survey on unsupervised subspace outlier detection methods for high dimensional data (고차원 자료의 비지도 부분공간 이상치 탐지기법에 대한 요약 연구)

  • Ahn, Jaehyeong;Kwon, Sunghoon
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.3
    • /
    • pp.507-521
    • /
    • 2021
  • Detecting outliers among high-dimensional data encounters a challenging problem of screening the variables since relevant information is often contained in only a few of the variables. Otherwise, when a number of irrelevant variables are included in the data, the distances between all observations tend to become similar which leads to making the degree of outlierness of all observations alike. The subspace outlier detection method overcomes the problem by measuring the degree of outlierness of the observation based on the relevant subsets of the entire variables. In this paper, we survey recent subspace outlier detection techniques, classifying them into three major types according to the subspace selection method. And we summarize the techniques of each type based on how to select the relevant subspaces and how to measure the degree of outlierness. In addition, we introduce some computing tools for implementing the subspace outlier detection techniques and present results from the simulation study and real data analysis.