• Title/Summary/Keyword: multivariate data analysis

Search Result 1,402, Processing Time 0.037 seconds

Simple Compromise Strategies in Multivariate Stratification

  • Park, Inho
    • Communications for Statistical Applications and Methods
    • /
    • v.20 no.2
    • /
    • pp.97-105
    • /
    • 2013
  • Stratification (among other applications) is a popular technique used in survey practice to improve the accuracy of estimators. Its full potential benefit can be gained by the effective use of auxiliary variables in stratification related to survey variables. This paper focuses on the problem of stratum formation when multiple stratification variables are available. We first review a variance reduction strategy in the case of univariate stratification. We then discuss its use for multivariate situations in convenient and efficient ways using three methods: compromised measures of size, principal components analysis and a K-means clustering algorithm. We also consider three types of compromising factors to data when using these three methods. Finally, we compare their efficiency using data from MU281 Swedish municipality population.

Racial and Social Economic Factors Impact on the Cause Specific Survival of Pancreatic Cancer: A SEER Survey

  • Cheung, Rex
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.14 no.1
    • /
    • pp.159-163
    • /
    • 2013
  • Background: This study used Surveillance, Epidemiology and End Results (SEER) pancreatic cancer data to identify predictive models and potential socio-economic disparities in pancreatic cancer outcome. Materials and Methods: For risk modeling, Kaplan Meier method was used for cause specific survival analysis. The Kolmogorov-Smirnov's test was used to compare survival curves. The Cox proportional hazard method was applied for multivariate analysis. The area under the ROC curve was computed for predictors of absolute risk of death, optimized to improve efficiency. Results: This study included 58,747 patients. The mean follow up time (S.D.) was 7.6 (10.6) months. SEER stage and grade were strongly predictive univariates. Sex, race, and three socio-economic factors (county level family income, rural-urban residence status, and county level education attainment) were independent multivariate predictors. Racial and socio-economic factors were associated with about 2% difference in absolute cause specific survival. Conclusions: This study s found significant effects of socio-economic factors on pancreas cancer outcome. These data may generate hypotheses for trials to eliminate these outcome disparities.

A Case Study on the Compatibility Analysis of Measurement Systems in Automobile Body Assembly

  • Lee, Myung-Duk;Lim, Ik-Sung;Sung, Chun-Ja
    • International Journal of Reliability and Applications
    • /
    • v.9 no.1
    • /
    • pp.7-15
    • /
    • 2008
  • The dimensional measurement equipment, such as Coordinate Measurement Machine (CMM), Optical Coordinate Measurement Machine (OCMM), and Checking Fixture (CF), take multiple dimensional measurements for each part in an automobile industry. Measurements are also recorded under different measurement systems to see if the responses differ significantly over these systems. Each measurement system (CMM, OCMM, and CF) will be considered as different treatments. This set-up provides massive amounts of process data which are multivariate in nature. Therefore, the multivariate statistical analysis is required to analyze data that are dependent on each other. This research provides step by step methodology for the evaluation procedure of the compatibility of measurement systems and clarify a systematic analyzation among the different measurement system's compatibility followed by number of case studies for each methodologies provided.

  • PDF

Fused inverse regression with multi-dimensional responses

  • Cho, Youyoung;Han, Hyoseon;Yoo, Jae Keun
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.3
    • /
    • pp.267-279
    • /
    • 2021
  • A regression with multi-dimensional responses is quite common nowadays in the so-called big data era. In such regression, to relieve the curse of dimension due to high-dimension of responses, the dimension reduction of predictors is essential in analysis. Sufficient dimension reduction provides effective tools for the reduction, but there are few sufficient dimension reduction methodologies for multivariate regression. To fill this gap, we newly propose two fused slice-based inverse regression methods. The proposed approaches are robust to the numbers of clusters or slices and improve the estimation results over existing methods by fusing many kernel matrices. Numerical studies are presented and are compared with existing methods. Real data analysis confirms practical usefulness of the proposed methods.

A Study on Constuct of Value-Added Productivity Structure Model using Multivariate Statistical Method (다변량통계기법을 이용한 부가가치생산성 구조모델의 구상에 관한 연구)

  • 이영찬;조성훈;김태성
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.19 no.38
    • /
    • pp.117-129
    • /
    • 1996
  • This Study intends to analysis what 3 factors, which are indices of Capital, Labor and Distribution, really affect to Value-Added Productivity through Statistical Analysis. For this, We selected 12 indices of Value-Added from the edition of 'Annual report of Korean companies' published in 'Korea Investors Service., Inc', especially in parts of Chemicals and Chemical products of total 85 companies. Using this data, Multivariate Statistical Analysis such as Principal Component Analysis, Factor Analysis, Covariance Structure Analysis is taken for modeling the effect of 3 factor(Labor Productivity, Capital Productivity and the Index of Distribution) on Value-Added Productivity.

  • PDF

Improving data reliability on oligonucleotide microarray

  • Yoon, Yeo-In;Lee, Young-Hak;Park, Jin-Hyun
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2004.11a
    • /
    • pp.107-116
    • /
    • 2004
  • The advent of microarray technologies gives an opportunity to moni tor the expression of ten thousands of genes, simultaneously. Such microarray data can be deteriorated by experimental errors and image artifacts, which generate non-negligible outliers that are estimated by 15% of typical microarray data. Thus, it is an important issue to detect and correct the se faulty probes prior to high-level data analysis such as classification or clustering. In this paper, we propose a systematic procedure for the detection of faulty probes and its proper correction in Genechip array based on multivariate statistical approaches. Principal component analysis (PCA), one of the most widely used multivariate statistical approaches, has been applied to construct a statistical correlation model with 20 pairs of probes for each gene. And, the faulty probes are identified by inspecting the squared prediction error (SPE) of each probe from the PCA model. Then, the outlying probes are reconstructed by the iterative optimization approach minimizing SPE. We used the public data presented from the gene chip project of human fibroblast cell. Through the application study, the proposed approach showed good performance for probe correction without removing faulty probes, which may be desirable in the viewpoint of the maximum use of data information.

  • PDF

Discrimination of cultivation ages and cultivars of ginseng leaves using Fourier transform infrared spectroscopy combined with multivariate analysis

  • Kwon, Yong-Kook;Ahn, Myung Suk;Park, Jong Suk;Liu, Jang Ryol;In, Dong Su;Min, Byung Whan;Kim, Suk Weon
    • Journal of Ginseng Research
    • /
    • v.38 no.1
    • /
    • pp.52-58
    • /
    • 2014
  • To determine whether Fourier transform (FT)-IR spectral analysis combined with multivariate analysis of whole-cell extracts from ginseng leaves can be applied as a high-throughput discrimination system of cultivation ages and cultivars, a total of total 480 leaf samples belonging to 12 categories corresponding to four different cultivars (Yunpung, Kumpung, Chunpung, and an open-pollinated variety) and three different cultivation ages (1 yr, 2 yr, and 3 yr) were subjected to FT-IR. The spectral data were analyzed by principal component analysis and partial least squares-discriminant analysis. A dendrogram based on hierarchical clustering analysis of the FT-IR spectral data on ginseng leaves showed that leaf samples were initially segregated into three groups in a cultivation age-dependent manner. Then, within the same cultivation age group, leaf samples were clustered into four subgroups in a cultivar-dependent manner. The overall prediction accuracy for discrimination of cultivars and cultivation ages was 94.8% in a cross-validation test. These results clearly show that the FT-IR spectra combined with multivariate analysis from ginseng leaves can be applied as an alternative tool for discriminating of ginseng cultivars and cultivation ages. Therefore, we suggest that this result could be used as a rapid and reliable F1 hybrid seed-screening tool for accelerating the conventional breeding of ginseng.

Evaluation of Water Quality in the Keum River using Statistics Analysis (통계분석 기법을 이용한 錦江水系의 水質評價)

  • Kim, Jong-Gu
    • Journal of Environmental Science International
    • /
    • v.11 no.12
    • /
    • pp.1281-1289
    • /
    • 2002
  • This study was conducted to evaluate water quality in the Keum River using multivariate analysis. The analysis data in Keum river made use of surveyed data by the ministry of environment from January 1994 to December 2001. Thirteen water quality parameter were determined on each sample. The results was summarized as follow; Water quality in the Keum River could be explained up to 71.39% by four factors which were included in loading of organic matter and nutrients by the tributaries (32.88%), seasonal variation (16.09%), loading of pathogenic bacteria by domestic sewage of Gapcheon (13.39%) and internal metabolism in estuary as lakes(9.03%). For spatial variation of factor score, four group was classified by each factor characterization. Station 1 and 2 was influenced by Daechung dam, station 3 was affected by domestic sewage of Gapcheon, station 10~12 was affected by estuary dyke and the rest station. The result of cluster analysis by station was classified into four group that has different water quality characteristics. In monthly cluster analysis, three group was classified according to seasonal characteristic. Also, in yearly cluster analysis, three group was classified. It is necessary to control the pollutant loadings by Gapcheon inflow domestic sewage in Daejeon city for the sake of water quality management of Keum river.

Application of Multivariate Statistical Analysis Technique in Landfill Investigation (매립물 특성 조사를 위한 다변량 통계분석 기법의 응용)

  • Kwon, Byung-Doo;Kim, Cha-Soup
    • Journal of the Korean earth science society
    • /
    • v.18 no.6
    • /
    • pp.515-521
    • /
    • 1997
  • To investigate the nature of the waste materials in the Nanjido Landfill, we have conducted multivariate statistical analysis of geophysical data set comprised of magnetic, gravity, LandSat TM thermal band and surface depression measurement data. Because these data sets show different responses to the depth, we have transformed the observed total field magnetic data and gravity data to the residual reduced-to-pole(RTP) magnetic anomalies and the three dimensional density anomalies, respectively, and utilized the informations about the upper shallow part of the landfills only in the following process. For the statistical analysis at the points of depression measurement, the magnetic, density and LandSat data values at these points are determined by interpolation process. Since the multivarite statistical analysis technique utilizes a clustering algorithm for classification of data set and we have measured the dissimilarity between objects by using Euclidean distance, standardization was applied prior to distance calculation in order to eliminate any scaling effects due to different measurement unit of each data set. The hierarchial grouping technique was used to construct the dendrogram. The optimum number of statistical groups(clusters), which are classified on the basis of geophysical and geotechnical characteristics, appeared to be six on the resulting dendrogram. The result of this study suggests that the dimension and nature of the multicomponent waste landfills can be identified by application of the multivarite statistical analysis technique to integrated geophysical data sets.

  • PDF

Anomaly Detection In Real Power Plant Vibration Data by MSCRED Base Model Improved By Subset Sampling Validation (Subset 샘플링 검증 기법을 활용한 MSCRED 모델 기반 발전소 진동 데이터의 이상 진단)

  • Hong, Su-Woong;Kwon, Jang-Woo
    • Journal of Convergence for Information Technology
    • /
    • v.12 no.1
    • /
    • pp.31-38
    • /
    • 2022
  • This paper applies an expert independent unsupervised neural network learning-based multivariate time series data analysis model, MSCRED(Multi-Scale Convolutional Recurrent Encoder-Decoder), and to overcome the limitation, because the MCRED is based on Auto-encoder model, that train data must not to be contaminated, by using learning data sampling technique, called Subset Sampling Validation. By using the vibration data of power plant equipment that has been labeled, the classification performance of MSCRED is evaluated with the Anomaly Score in many cases, 1) the abnormal data is mixed with the training data 2) when the abnormal data is removed from the training data in case 1. Through this, this paper presents an expert-independent anomaly diagnosis framework that is strong against error data, and presents a concise and accurate solution in various fields of multivariate time series data.