• Title/Summary/Keyword: Statistical Data

Search Result 14,907, Processing Time 0.096 seconds

On statistical Computing via EM Algorithm in Logistic Linear Models Involving Non-ignorable Missing data

  • Jun, Yu-Na;Qian, Guoqi;Park, Jeong-Soo
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2005.11a
    • /
    • pp.181-186
    • /
    • 2005
  • Many data sets obtained from surveys or medical trials often include missing observations. When these data sets are analyzed, it is general to use only complete cases. However, it is possible to have big biases or involve inefficiency. In this paper, we consider a method for estimating parameters in logistic linear models involving non-ignorable missing data mechanism. A binomial response and normal exploratory model for the missing data are used. We fit the model using the EM algorithm. The E-step is derived by Metropolis-hastings algorithm to generate a sample for missing data and Monte-carlo technique, and the M-step is by Newton-Raphson to maximize likelihood function. Asymptotic variances of the MLE's are derived and the standard error and estimates of parameters are compared.

  • PDF

Principles of Multivariate Data Visualization

  • Huh, Moon Yul;Cha, Woon Ock
    • Communications for Statistical Applications and Methods
    • /
    • v.11 no.3
    • /
    • pp.465-474
    • /
    • 2004
  • Data visualization is the automation process and the discovery process to data sets in an effort to discover underlying information from the data. It provides rich visual depictions of the data. It has distinct advantages over traditional data analysis techniques such as exploring the structure of large scale data set both in the sense of number of observations and the number of variables by allowing great interaction with the data and end-user. We discuss the principles of data visualization and evaluate the characteristics of various tools of visualization according to these principles.

Application Scheme of Hybrid Data Mining for Fused Data in Statistical Survey (통계조사에서의 퓨전된 자료에 대한 하이브리드 데이터마이닝의 적용 방안)

  • Park, Hee-Chang;Cho, Kwang-Hyun
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.3
    • /
    • pp.399-411
    • /
    • 2008
  • Today, the statistical survey has been carried out variously for the decision-making and administration of the organization. We use the different items in the statistical survey according to the purpose of study. Currently, Gyeongnam province is executing the social index survey to the provincials every year. But, this survey has the limit of the analysis as execution of the different survey per 3 year cycles. The solution for this problem is data fusion technique. Data fusion is generally defined as the use of techniques that collect to combine data including multiple sources in order to raise the quality of information. But, data fusion doesn't mean the ultimate result. Therefor, efficient analysis for the fused data is also important. In this study, we suggest the application methodology of neural network by latent variable through the fused data in statistical survey.

Nonlinear Regression with Censored Data

  • Shin, D.W.;Bai, D.S.
    • Journal of the Korean Statistical Society
    • /
    • v.12 no.1
    • /
    • pp.46-56
    • /
    • 1983
  • An algorithm based on EM procedure which finds maximum likelihood estimators in a nonlinear regression with censored data is proposed, and asymptotic properties of the estimator are investigated in detail. Some numerical examples are also given.

  • PDF

Rainstorm Tracking Using Statistical Analysis Method (통계적 기법을 이용한 국지성집중호우의 이동경로 분석)

  • Kim Sooyoung;Nam Woo-Sung;Heo Jun-Haeng
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2005.05b
    • /
    • pp.194-198
    • /
    • 2005
  • Although the rainstorm causes local damage on large scale, it is difficult to predict the movement of the rainstorm exactly. In order to reduce the rainstorm damage of the rainstorm, it is necessary to analyze the path of the rainstorm using various statistical methods. In addition, efficient time interval of rainfall observation for the analysis of the rainstorm movement can be derived by applying various statistical methods to rainfall data. In this study, the rainstorm tracking using statistical method is performed for various types of rainfall data. For the tracking of the rainstorm, the methods of temporal distribution, inclined Plane equations, and cross correlation were applied for various types of data including electromagnetic rainfall gauge data and AWS data. The speed and direction of each method were compared with those of real rainfall movement. In addition, the effective time interval of rainfall observation for the analysis of the rainstorm movement was also investigated for the selected time intervals 10, 20, 30, 40, 50, and 60 minutes. As a result, the absolute relative errors of the method of inclined plane equations are smaller than those of other methods in case of electromagnetic rainfall gauges data. The absolute relative errors of the method of cross correlation are smaller than those of other methods in case of AWS data. The absolute relative errors of 30 minutes or less than 30 minutes are smaller than those of other time intervals.

  • PDF

Statistical Metadata for Users: A Case Study on the Level of Metadata Provision on Statistical Agency Websites (웹 이용자를 위한 통계 메타데이터: 통계정보 제공사이트의 메타데이터 제공 수준 평가 사례 연구)

  • Oh, Jung-Sun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.2
    • /
    • pp.161-179
    • /
    • 2007
  • As increasingly diverse kinds of information materials are available on the Internet, it becomes a challenge to define an adequate level of metadata provision for each different type of material in the context of digital libraries. This study explores issues of metadata provision for a particular type of material, statistical tables. Statistical data always involves numbers and numeric values which should be interpreted with an understanding of underlying concepts and constructs. Because of the unique data characteristics, metadata in the statistical domain is essential not only for finding and discovering relevant data, but also for understanding and using the data found. However, in statistical metadata research, more emphasis has been put on the question of what metadata is necessary for processing the data and less on what metadata should be presented to users. In this study, a case study was conducted to gauge the status of metadata provision for statistical tables on the Internet. The websites of two federal statistical agencies in the United States were selected and a content analysis method was used for that purpose. The result showing insufficient and inconsistent provision of metadata demonstrate the need for more discussions on statistical metadata from the ordinary web users' perspective.

PREDICTION OF DAILY MAXIMUM X-RAY FLUX USING MULTILINEAR REGRESSION AND AUTOREGRESSIVE TIME-SERIES METHODS

  • Lee, J.Y.;Moon, Y.J.;Kim, K.S.;Park, Y.D.;Fletcher, A.B.
    • Journal of The Korean Astronomical Society
    • /
    • v.40 no.4
    • /
    • pp.99-106
    • /
    • 2007
  • Statistical analyses were performed to investigate the relative success and accuracy of daily maximum X-ray flux (MXF) predictions, using both multilinear regression and autoregressive time-series prediction methods. As input data for this work, we used 14 solar activity parameters recorded over the prior 2 year period (1989-1990) during the solar maximum of cycle 22. We applied the multilinear regression method to the following three groups: all 14 variables (G1), the 2 so-called 'cause' variables (sunspot complexity and sunspot group area) showing the highest correlations with MXF (G2), and the 2 'effect' variables (previous day MXF and the number of flares stronger than C4 class) showing the highest correlations with MXF (G3). For the advanced three days forecast, we applied the autoregressive timeseries method to the MXF data (GT). We compared the statistical results of these groups for 1991 data, using several statistical measures obtained from a $2{\times}2$ contingency table for forecasted versus observed events. As a result, we found that the statistical results of G1 and G3 are nearly the same each other and the 'effect' variables (G3) are more reliable predictors than the 'cause' variables. It is also found that while the statistical results of GT are a little worse than those of G1 for relatively weak flares, they are comparable to each other for strong flares. In general, all statistical measures show good predictions from all groups, provided that the flares are weaker than about M5 class; stronger flares rapidly become difficult to predict well, which is probably due to statistical inaccuracies arising from their rarity. Our statistical results of all flares except for the X-class flares were confirmed by Yates' $X^2$ statistical significance tests, at the 99% confidence level. Based on our model testing, we recommend a practical strategy for solar X-ray flare predictions.

A Statistical Analysis of Professional Baseball Team Data: The Case of the Lotte Giants

  • Cho, Young-Seuk;Han, Jun-Tae;Park, Chan-Keun;Heo, Tae-Young
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.6
    • /
    • pp.1191-1199
    • /
    • 2010
  • Knowing what factors into a player's ability to affect the outcome of a sports game is crucial. This knowledge helps determine the relative degree of contribution by each team member as well as sets appropriate annual salaries. This study uses statistical analysis to investigate how much the outcome of a professional baseball game is influenced by the records of individual players. We used the Lotte Giants' data on 252 games played between 2007 and 2008 that included environmental data(home or away games and opponents) as well as pitchers' and batters' data. Using a SAS Enterprise Miner, we performed a logistic regression analysis and decision tree analysis on the data. The results obtained through the two analytic methods are compared and discussed.

A case study of competing risk analysis in the presence of missing data

  • Limei Zhou;Peter C. Austin;Husam Abdel-Qadir
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.1
    • /
    • pp.1-19
    • /
    • 2023
  • Observational data with missing or incomplete data are common in biomedical research. Multiple imputation is an effective approach to handle missing data with the ability to decrease bias while increasing statistical power and efficiency. In recent years propensity score (PS) matching has been increasingly used in observational studies to estimate treatment effect as it can reduce confounding due to measured baseline covariates. In this paper, we describe in detail approaches to competing risk analysis in the setting of incomplete observational data when using PS matching. First, we used multiple imputation to impute several missing variables simultaneously, then conducted propensity-score matching to match statin-exposed patients with those unexposed. Afterwards, we assessed the effect of statin exposure on the risk of heart failure-related hospitalizations or emergency visits by estimating both relative and absolute effects. Collectively, we provided a general methodological framework to assess treatment effect in incomplete observational data. In addition, we presented a practical approach to produce overall cumulative incidence function (CIF) based on estimates from multiple imputed and PS-matched samples.

Applications of response dimension reduction in large p-small n problems

  • Minjee Kim;Jae Keun Yoo
    • Communications for Statistical Applications and Methods
    • /
    • v.31 no.2
    • /
    • pp.191-202
    • /
    • 2024
  • The goal of this paper is to show how multivariate regression analysis with high-dimensional responses is facilitated by the response dimension reduction. Multivariate regression, characterized by multi-dimensional response variables, is increasingly prevalent across diverse fields such as repeated measures, longitudinal studies, and functional data analysis. One of the key challenges in analyzing such data is managing the response dimensions, which can complicate the analysis due to an exponential increase in the number of parameters. Although response dimension reduction methods are developed, there is no practically useful illustration for various types of data such as so-called large p-small n data. This paper aims to fill this gap by showcasing how response dimension reduction can enhance the analysis of high-dimensional response data, thereby providing significant assistance to statistical practitioners and contributing to advancements in multiple scientific domains.