• Title/Summary/Keyword: Multivariate statistical models

Search Result 126, Processing Time 0.022 seconds

Issues Related to the Use of Time Series in Model Building and Analysis: Review Article

  • Wei, William W.S.
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.3
    • /
    • pp.209-222
    • /
    • 2015
  • Time series are used in many studies for model building and analysis. We must be very careful to understand the kind of time series data used in the analysis. In this review article, we will begin with some issues related to the use of aggregate and systematic sampling time series. Since several time series are often used in a study of the relationship of variables, we will also consider vector time series modeling and analysis. Although the basic procedures of model building between univariate time series and vector time series are the same, there are some important phenomena which are unique to vector time series. Therefore, we will also discuss some issues related to vector time models. Understanding these issues is important when we use time series data in modeling and analysis, regardless of whether it is a univariate or multivariate time series.

Multiple Testing in Genomic Sequences Using Hamming Distance

  • Kang, Moonsu
    • Communications for Statistical Applications and Methods
    • /
    • v.19 no.6
    • /
    • pp.899-904
    • /
    • 2012
  • High-dimensional categorical data models with small sample sizes have not been used extensively in genomic sequences that involve count (or discrete) or purely qualitative responses. A basic task is to identify differentially expressed genes (or positions) among a number of genes. It requires an appropriate test statistics and a corresponding multiple testing procedure so that a multivariate analysis of variance should not be feasible. A family wise error rate(FWER) is not appropriate to test thousands of genes simultaneously in a multiple testing procedure. False discovery rate(FDR) is better than FWER in multiple testing problems. The data from the 2002-2003 SARS epidemic shows that a conventional FDR procedure and a proposed test statistic based on a pseudo-marginal approach with Hamming distance performs better.

Bayesian Multiple Change-Point Estimation and Segmentation

  • Kim, Jaehee;Cheon, Sooyoung
    • Communications for Statistical Applications and Methods
    • /
    • v.20 no.6
    • /
    • pp.439-454
    • /
    • 2013
  • This study presents a Bayesian multiple change-point detection approach to segment and classify the observations that no longer come from an initial population after a certain time. Inferences are based on the multiple change-points in a sequence of random variables where the probability distribution changes. Bayesian multiple change-point estimation is classifies each observation into a segment. We use a truncated Poisson distribution for the number of change-points and conjugate prior for the exponential family distributions. The Bayesian method can lead the unsupervised classification of discrete, continuous variables and multivariate vectors based on latent class models; therefore, the solution for change-points corresponds to the stochastic partitions of observed data. We demonstrate segmentation with real data.

Statistical analysis of metagenomics data

  • Calle, M. Luz
    • Genomics & Informatics
    • /
    • v.17 no.1
    • /
    • pp.6.1-6.9
    • /
    • 2019
  • Understanding the role of the microbiome in human health and how it can be modulated is becoming increasingly relevant for preventive medicine and for the medical management of chronic diseases. The development of high-throughput sequencing technologies has boosted microbiome research through the study of microbial genomes and allowing a more precise quantification of microbiome abundances and function. Microbiome data analysis is challenging because it involves high-dimensional structured multivariate sparse data and because of its compositional nature. In this review we outline some of the procedures that are most commonly used for microbiome analysis and that are implemented in R packages. We place particular emphasis on the compositional structure of microbiome data. We describe the principles of compositional data analysis and distinguish between standard methods and those that fit into compositional data analysis.

Estimating the Survival of Patients With Lung Cancer: What Is the Best Statistical Model?

  • Abedi, Siavosh;Janbabaei, Ghasem;Afshari, Mahdi;Moosazadeh, Mahmood;Alashti, Masoumeh Rashidi;Hedayatizadeh-Omran, Akbar;Alizadeh-Navaei, Reza;Abedini, Ehsan
    • Journal of Preventive Medicine and Public Health
    • /
    • v.52 no.2
    • /
    • pp.140-144
    • /
    • 2019
  • Objectives: Investigating the survival of patients with cancer is vitally necessary for controlling the disease and for assessing treatment methods. This study aimed to compare various statistical models of survival and to determine the survival rate and its related factors among patients suffering from lung cancer. Methods: In this retrospective cohort, the cumulative survival rate, median survival time, and factors associated with the survival of lung cancer patients were estimated using Cox, Weibull, exponential, and Gompertz regression models. Kaplan-Meier tables and the log-rank test were also used to analyze the survival of patients in different subgroups. Results: Of 102 patients with lung cancer, 74.5% were male. During the follow-up period, 80.4% died. The incidence rate of death among patients was estimated as 3.9 (95% confidence [CI], 3.1 to 4.8) per 100 person-months. The 5-year survival rate for all patients, males, females, patients with non-small cell lung carcinoma (NSCLC), and patients with small cell lung carcinoma (SCLC) was 17%, 13%, 29%, 21%, and 0%, respectively. The median survival time for all patients, males, females, those with NSCLC, and those with SCLC was 12.7 months, 12.0 months, 16.0 months, 16.0 months, and 6.0 months, respectively. Multivariate analyses indicated that the hazard ratios (95% CIs) for male sex, age, and SCLC were 0.56 (0.33 to 0.93), 1.03 (1.01 to 1.05), and 2.91 (1.71 to 4.95), respectively. Conclusions: Our results showed that the exponential model was the most precise. This model identified age, sex, and type of cancer as factors that predicted survival in patients with lung cancer.

Assessments for MGARCH Models Using Back-Testing: Case Study (사후검증(Back-testing)을 통한 다변량-GARCH 모형의 평가: 사례분석)

  • Hwang, S.Y.;Choi, M.S.;Do, J.D.
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.2
    • /
    • pp.261-270
    • /
    • 2009
  • Current financial crisis triggered by shaky U.S. banking system adds to the emphasis on the importance of the volatility in controlling and understanding financial time series data. The ARCH and GARCH models have been useful in analyzing economic time series volatilities. In particular, multivariate GARCH(MGARCH, for short) provides both volatilities and conditional correlations between several time series and these are in turn applied to computations of hedge-ratio and VaR. In this short article, we try to assess various MGARCH models with respect to the back-testing performances in VaR study. To this end, 14 korean stock prices are analyzed and it is found that MGARCH outperforms rolling window, and BEKK and CCC are relatively conservative in back-testing performance.

Modern vistas of process control

  • Georgakis, Christos
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 1996.10a
    • /
    • pp.18-18
    • /
    • 1996
  • This paper reviews some of the most prominent and promising areas of chemical process control both in relations to batch and continuous processes. These areas include the modeling, optimization, control and monitoring of chemical processes and entire plants. Most of these areas explicitly utilize a model of the process. For this purpose the types of models used are examined in some detail. These types of models are categorized in knowledge-driven and datadriven classes. In the areas of modeling and optimization, attention is paid to batch reactors using the Tendency Modeling approach. These Tendency models consist of data- and knowledge-driven components and are often called Gray or Hybrid models. In the case of continuous processes, emphasis is placed in the closed-loop identification of a state space model and their use in Model Predictive Control nonlinear processes, such as the Fluidized Catalytic Cracking process. The effective monitoring of multivariate process is examined through the use of statistical charts obtained by the use of Principal Component Analysis (PMC). Static and dynamic charts account for the cross and auto-correlation of the substantial number of variables measured on-line. Centralized and de-centralized chart also aim in isolating the source of process disturbances so that they can be eliminated. Even though significant progress has been made during the last decade, the challenges for the next ten years are substantial. Present progress is strongly influenced by the economical benefits industry is deriving from the use of these advanced techniques. Future progress will be further catalyzed from the harmonious collaboration of University and Industrial researchers.

  • PDF

Hybrid Learning Architectures for Advanced Data Mining:An Application to Binary Classification for Fraud Management (개선된 데이터마이닝을 위한 혼합 학습구조의 제시)

  • Kim, Steven H.;Shin, Sung-Woo
    • Journal of Information Technology Application
    • /
    • v.1
    • /
    • pp.173-211
    • /
    • 1999
  • The task of classification permeates all walks of life, from business and economics to science and public policy. In this context, nonlinear techniques from artificial intelligence have often proven to be more effective than the methods of classical statistics. The objective of knowledge discovery and data mining is to support decision making through the effective use of information. The automated approach to knowledge discovery is especially useful when dealing with large data sets or complex relationships. For many applications, automated software may find subtle patterns which escape the notice of manual analysis, or whose complexity exceeds the cognitive capabilities of humans. This paper explores the utility of a collaborative learning approach involving integrated models in the preprocessing and postprocessing stages. For instance, a genetic algorithm effects feature-weight optimization in a preprocessing module. Moreover, an inductive tree, artificial neural network (ANN), and k-nearest neighbor (kNN) techniques serve as postprocessing modules. More specifically, the postprocessors act as second0order classifiers which determine the best first-order classifier on a case-by-case basis. In addition to the second-order models, a voting scheme is investigated as a simple, but efficient, postprocessing model. The first-order models consist of statistical and machine learning models such as logistic regression (logit), multivariate discriminant analysis (MDA), ANN, and kNN. The genetic algorithm, inductive decision tree, and voting scheme act as kernel modules for collaborative learning. These ideas are explored against the background of a practical application relating to financial fraud management which exemplifies a binary classification problem.

  • PDF

Multivariate quantile regression tree (다변량 분위수 회귀나무 모형에 대한 연구)

  • Kim, Jaeoh;Cho, HyungJun;Bang, Sungwan
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.3
    • /
    • pp.533-545
    • /
    • 2017
  • Quantile regression models provide a variety of useful statistical information by estimating the conditional quantile function of the response variable. However, the traditional linear quantile regression model can lead to the distorted and incorrect results when analysing real data having a nonlinear relationship between the explanatory variables and the response variables. Furthermore, as the complexity of the data increases, it is required to analyse multiple response variables simultaneously with more sophisticated interpretations. For such reasons, we propose a multivariate quantile regression tree model. In this paper, a new split variable selection algorithm is suggested for a multivariate regression tree model. This algorithm can select the split variable more accurately than the previous method without significant selection bias. We investigate the performance of our proposed method with both simulation and real data studies.

Principal selected response reduction in multivariate regression (다변량회귀에서 주선택 반응변수 차원축소)

  • Yoo, Jae Keun
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.4
    • /
    • pp.659-669
    • /
    • 2021
  • Multivariate regression often appears in longitudinal or functional data analysis. Since multivariate regression involves multi-dimensional response variables, it is more strongly affected by the so-called curse of dimension that univariate regression. To overcome this issue, Yoo (2018) and Yoo (2019a) proposed three model-based response dimension reduction methodologies. According to various numerical studies in Yoo (2019a), the default method suggested in Yoo (2019a) is least sensitive to the simulated models, but it is not the best one. To release this issue, the paper proposes an selection algorithm by comparing the other two methods with the default one. This approach is called principal selected response reduction. Various simulation studies show that the proposed method provides more accurate estimation results than the default one by Yoo (2019a), and it confirms practical and empirical usefulness of the propose method over the default one by Yoo (2019a).