• 제목/요약/키워드: multivariate data analysis

검색결과 1,402건 처리시간 0.027초

Simple Compromise Strategies in Multivariate Stratification

  • Park, Inho
    • Communications for Statistical Applications and Methods
    • /
    • 제20권2호
    • /
    • pp.97-105
    • /
    • 2013
  • Stratification (among other applications) is a popular technique used in survey practice to improve the accuracy of estimators. Its full potential benefit can be gained by the effective use of auxiliary variables in stratification related to survey variables. This paper focuses on the problem of stratum formation when multiple stratification variables are available. We first review a variance reduction strategy in the case of univariate stratification. We then discuss its use for multivariate situations in convenient and efficient ways using three methods: compromised measures of size, principal components analysis and a K-means clustering algorithm. We also consider three types of compromising factors to data when using these three methods. Finally, we compare their efficiency using data from MU281 Swedish municipality population.

Racial and Social Economic Factors Impact on the Cause Specific Survival of Pancreatic Cancer: A SEER Survey

  • Cheung, Rex
    • Asian Pacific Journal of Cancer Prevention
    • /
    • 제14권1호
    • /
    • pp.159-163
    • /
    • 2013
  • Background: This study used Surveillance, Epidemiology and End Results (SEER) pancreatic cancer data to identify predictive models and potential socio-economic disparities in pancreatic cancer outcome. Materials and Methods: For risk modeling, Kaplan Meier method was used for cause specific survival analysis. The Kolmogorov-Smirnov's test was used to compare survival curves. The Cox proportional hazard method was applied for multivariate analysis. The area under the ROC curve was computed for predictors of absolute risk of death, optimized to improve efficiency. Results: This study included 58,747 patients. The mean follow up time (S.D.) was 7.6 (10.6) months. SEER stage and grade were strongly predictive univariates. Sex, race, and three socio-economic factors (county level family income, rural-urban residence status, and county level education attainment) were independent multivariate predictors. Racial and socio-economic factors were associated with about 2% difference in absolute cause specific survival. Conclusions: This study s found significant effects of socio-economic factors on pancreas cancer outcome. These data may generate hypotheses for trials to eliminate these outcome disparities.

A Case Study on the Compatibility Analysis of Measurement Systems in Automobile Body Assembly

  • Lee, Myung-Duk;Lim, Ik-Sung;Sung, Chun-Ja
    • International Journal of Reliability and Applications
    • /
    • 제9권1호
    • /
    • pp.7-15
    • /
    • 2008
  • The dimensional measurement equipment, such as Coordinate Measurement Machine (CMM), Optical Coordinate Measurement Machine (OCMM), and Checking Fixture (CF), take multiple dimensional measurements for each part in an automobile industry. Measurements are also recorded under different measurement systems to see if the responses differ significantly over these systems. Each measurement system (CMM, OCMM, and CF) will be considered as different treatments. This set-up provides massive amounts of process data which are multivariate in nature. Therefore, the multivariate statistical analysis is required to analyze data that are dependent on each other. This research provides step by step methodology for the evaluation procedure of the compatibility of measurement systems and clarify a systematic analyzation among the different measurement system's compatibility followed by number of case studies for each methodologies provided.

  • PDF

Fused inverse regression with multi-dimensional responses

  • Cho, Youyoung;Han, Hyoseon;Yoo, Jae Keun
    • Communications for Statistical Applications and Methods
    • /
    • 제28권3호
    • /
    • pp.267-279
    • /
    • 2021
  • A regression with multi-dimensional responses is quite common nowadays in the so-called big data era. In such regression, to relieve the curse of dimension due to high-dimension of responses, the dimension reduction of predictors is essential in analysis. Sufficient dimension reduction provides effective tools for the reduction, but there are few sufficient dimension reduction methodologies for multivariate regression. To fill this gap, we newly propose two fused slice-based inverse regression methods. The proposed approaches are robust to the numbers of clusters or slices and improve the estimation results over existing methods by fusing many kernel matrices. Numerical studies are presented and are compared with existing methods. Real data analysis confirms practical usefulness of the proposed methods.

다변량통계기법을 이용한 부가가치생산성 구조모델의 구상에 관한 연구 (A Study on Constuct of Value-Added Productivity Structure Model using Multivariate Statistical Method)

  • 이영찬;조성훈;김태성
    • 산업경영시스템학회지
    • /
    • 제19권38호
    • /
    • pp.117-129
    • /
    • 1996
  • This Study intends to analysis what 3 factors, which are indices of Capital, Labor and Distribution, really affect to Value-Added Productivity through Statistical Analysis. For this, We selected 12 indices of Value-Added from the edition of 'Annual report of Korean companies' published in 'Korea Investors Service., Inc', especially in parts of Chemicals and Chemical products of total 85 companies. Using this data, Multivariate Statistical Analysis such as Principal Component Analysis, Factor Analysis, Covariance Structure Analysis is taken for modeling the effect of 3 factor(Labor Productivity, Capital Productivity and the Index of Distribution) on Value-Added Productivity.

  • PDF

Improving data reliability on oligonucleotide microarray

  • Yoon, Yeo-In;Lee, Young-Hak;Park, Jin-Hyun
    • 한국생물정보학회:학술대회논문집
    • /
    • 한국생물정보시스템생물학회 2004년도 The 3rd Annual Conference for The Korean Society for Bioinformatics Association of Asian Societies for Bioinformatics 2004 Symposium
    • /
    • pp.107-116
    • /
    • 2004
  • The advent of microarray technologies gives an opportunity to moni tor the expression of ten thousands of genes, simultaneously. Such microarray data can be deteriorated by experimental errors and image artifacts, which generate non-negligible outliers that are estimated by 15% of typical microarray data. Thus, it is an important issue to detect and correct the se faulty probes prior to high-level data analysis such as classification or clustering. In this paper, we propose a systematic procedure for the detection of faulty probes and its proper correction in Genechip array based on multivariate statistical approaches. Principal component analysis (PCA), one of the most widely used multivariate statistical approaches, has been applied to construct a statistical correlation model with 20 pairs of probes for each gene. And, the faulty probes are identified by inspecting the squared prediction error (SPE) of each probe from the PCA model. Then, the outlying probes are reconstructed by the iterative optimization approach minimizing SPE. We used the public data presented from the gene chip project of human fibroblast cell. Through the application study, the proposed approach showed good performance for probe correction without removing faulty probes, which may be desirable in the viewpoint of the maximum use of data information.

  • PDF

Discrimination of cultivation ages and cultivars of ginseng leaves using Fourier transform infrared spectroscopy combined with multivariate analysis

  • Kwon, Yong-Kook;Ahn, Myung Suk;Park, Jong Suk;Liu, Jang Ryol;In, Dong Su;Min, Byung Whan;Kim, Suk Weon
    • Journal of Ginseng Research
    • /
    • 제38권1호
    • /
    • pp.52-58
    • /
    • 2014
  • To determine whether Fourier transform (FT)-IR spectral analysis combined with multivariate analysis of whole-cell extracts from ginseng leaves can be applied as a high-throughput discrimination system of cultivation ages and cultivars, a total of total 480 leaf samples belonging to 12 categories corresponding to four different cultivars (Yunpung, Kumpung, Chunpung, and an open-pollinated variety) and three different cultivation ages (1 yr, 2 yr, and 3 yr) were subjected to FT-IR. The spectral data were analyzed by principal component analysis and partial least squares-discriminant analysis. A dendrogram based on hierarchical clustering analysis of the FT-IR spectral data on ginseng leaves showed that leaf samples were initially segregated into three groups in a cultivation age-dependent manner. Then, within the same cultivation age group, leaf samples were clustered into four subgroups in a cultivar-dependent manner. The overall prediction accuracy for discrimination of cultivars and cultivation ages was 94.8% in a cross-validation test. These results clearly show that the FT-IR spectra combined with multivariate analysis from ginseng leaves can be applied as an alternative tool for discriminating of ginseng cultivars and cultivation ages. Therefore, we suggest that this result could be used as a rapid and reliable F1 hybrid seed-screening tool for accelerating the conventional breeding of ginseng.

통계분석 기법을 이용한 錦江水系의 水質評價 (Evaluation of Water Quality in the Keum River using Statistics Analysis)

  • 김종구
    • 한국환경과학회지
    • /
    • 제11권12호
    • /
    • pp.1281-1289
    • /
    • 2002
  • This study was conducted to evaluate water quality in the Keum River using multivariate analysis. The analysis data in Keum river made use of surveyed data by the ministry of environment from January 1994 to December 2001. Thirteen water quality parameter were determined on each sample. The results was summarized as follow; Water quality in the Keum River could be explained up to 71.39% by four factors which were included in loading of organic matter and nutrients by the tributaries (32.88%), seasonal variation (16.09%), loading of pathogenic bacteria by domestic sewage of Gapcheon (13.39%) and internal metabolism in estuary as lakes(9.03%). For spatial variation of factor score, four group was classified by each factor characterization. Station 1 and 2 was influenced by Daechung dam, station 3 was affected by domestic sewage of Gapcheon, station 10~12 was affected by estuary dyke and the rest station. The result of cluster analysis by station was classified into four group that has different water quality characteristics. In monthly cluster analysis, three group was classified according to seasonal characteristic. Also, in yearly cluster analysis, three group was classified. It is necessary to control the pollutant loadings by Gapcheon inflow domestic sewage in Daejeon city for the sake of water quality management of Keum river.

매립물 특성 조사를 위한 다변량 통계분석 기법의 응용 (Application of Multivariate Statistical Analysis Technique in Landfill Investigation)

  • 권병두;김차섭
    • 한국지구과학회지
    • /
    • 제18권6호
    • /
    • pp.515-521
    • /
    • 1997
  • 난지도 매립장 매립물의 특성을 조사하기 위해서 중력, 자력, LandSat TM 열적외선 밴드 자료, 매립장의 표면에서 측정한 침하량 자료 등을 다변량 통계분석기법을 응용하여 분석하였다. 분석에 이용한 자료들은 각기 상이한 깊이에 관한 정보를 제공하기 때문에 측정된 총 자력자료와 중력자료는 자극화변환된 자력이상과 매립장의 3차원 밀도분포로 각기 전환하였으며, 본 연구에서는 이 중 매립장의 상부층에 관한 정보를 이용하였다. 통계분석은 침하량 측정 지점들을 대상으로 수행하였으며, 이들 지점에서의 자극화변환 자력이상, 매립물의 밀도, LandSat TM 열적외선 밴드 값들은 내삽방법을 이용하여 구하였다. 자료분석에 사용한 다변량 통계분석 기법은 개체간의 기하학적인 거리를 이용하여 군집화하는 집락분석으로, 개체간의 거리 계산시 각 자료간의 상이한 측정단위가 주는 효과를 제거하기 위해서 사전에 표준화를 실시하였다. 군집화는 체계적 군집화 방법을 이용하여 수행하였다. 물리적 특성을 바탕으로 분류된 최적의 군집수는 수상도에서 나타난 결과에 따르면 총 6개의 군집으로 나타났다. 본 연구의 결과는 통합된 지구물리자료에 다변량 통계분석 기법을 적용함으로써 복합적 인 쓰레기 매립장의 특성 규명이 가능함을 시사한다.

  • PDF

Subset 샘플링 검증 기법을 활용한 MSCRED 모델 기반 발전소 진동 데이터의 이상 진단 (Anomaly Detection In Real Power Plant Vibration Data by MSCRED Base Model Improved By Subset Sampling Validation)

  • 홍수웅;권장우
    • 융합정보논문지
    • /
    • 제12권1호
    • /
    • pp.31-38
    • /
    • 2022
  • 본 논문은 전문가 독립적 비지도 신경망 학습 기반 다변량 시계열 데이터 분석 모델인 MSCRED(Multi-Scale Convolutional Recurrent Encoder-Decoder)의 실제 현장에서의 적용과 Auto-encoder 기반인 MSCRED 모델의 한계인, 학습 데이터가 오염되지 않아야 된다는 점을 극복하기 위한 학습 데이터 샘플링 기법인 Subset Sampling Validation을 제시한다. 라벨 분류가 되어있는 발전소 장비의 진동 데이터를 이용하여 1) 학습 데이터에 비정상 데이터가 섞여 있는 상황을 재현하고, 이를 학습한 경우 2) 1과 같은 상황에서 Subset Sampling Validation 기법을 통해 학습 데이터에서 비정상 데이터를 제거한 경우의 Anomaly Score를 비교하여 MSCRED와 Subset Sampling Validation 기법을 유효성을 평가한다. 이를 통해 본 논문은 전문가 독립적이며 오류 데이터에 강한 이상 진단 프레임워크를 제시해, 다양한 다변량 시계열 데이터 분야에서의 간결하고 정확한 해결 방법을 제시한다.