• Title/Summary/Keyword: Statistical Data

Search Result 14,982, Processing Time 0.041 seconds

Detection of Differentially Expressed Genes by Clustering Genes Using Class-Wise Averaged Data in Microarray Data

  • Kim, Seung-Gu
    • Communications for Statistical Applications and Methods
    • /
    • v.14 no.3
    • /
    • pp.687-698
    • /
    • 2007
  • A normal mixture model with which dependence between classes is incorporated is proposed in order to detect differentially expressed genes. Gene clustering approaches suffer from the high dimensional column of microarray expression data matrix which leads to the over-fit problem. Various methods are proposed to solve the problem. In this paper, use of simple averaging data within each class is proposed to overcome the various problems due to high dimensionality when the normal mixture model is fitted. Some experiments through simulated data set and real data set show its availability in actuality.

Binary classification on compositional data

  • Joo, Jae Yun;Lee, Seokho
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.1
    • /
    • pp.89-97
    • /
    • 2021
  • Due to boundedness and sum constraint, compositional data are often transformed by logratio transformation and their transformed data are put into traditional binary classification or discriminant analysis. However, it may be problematic to directly apply traditional multivariate approaches to the transformed data because class distributions are not Gaussian and Bayes decision boundary are not polynomial on the transformed space. In this study, we propose to use flexible classification approaches to transformed data for compositional data classification. Empirical studies using synthetic and real examples demonstrate that flexible approaches outperform traditional multivariate classification or discriminant analysis.

PREDICTION OF 23RD SOLAR CYCLE USING THE STATISTICAL AND PRECURSOR METHOD (통계 및 프리커서 방법을 이용한 제23주기 태양활동예보)

  • JANG SE JIN;KIM KAP-SUNG
    • Publications of The Korean Astronomical Society
    • /
    • v.14 no.2
    • /
    • pp.91-102
    • /
    • 1999
  • We have made intensive calculations on the maximum relative sunspot number and the date of solar maximum of 23rd solar cycle, by using the statistical and precursor methods to predict solar activity cycle. According to our results of solar data processing by statistical method, solar maximum comes at between February and July of 2000 year and at that time, the smoothed sunspot number will reach to $114.3\~122.8$. while precursor method gives rather dispersed value of $118\~17$ maximum sunspot number. It is found that prediction by statistical method using smoothed relative sunspot number is more accurate than by any method to use any data of 10.7cm radio fluxes and geomagnetic aa, Ap indexes, from the full analysis of solar cycle pattern of these data. In fact, current ascending pattern of 23rd solar cycle supports positively our predicted values. Predicted results by precursor method for $Ap_{avg},\;aa_{31-36}$ indexes show similar values to those by statistical method. Therefore, these indexes can be used as new precursors for the prediction of 23rd or next solar cycle.

  • PDF

Patent and Statistics, What's the Connection? (특허와 통계학, 그 연결은?)

  • Jun, Sung-Hae;Uhm, Dai-Ho
    • Communications for Statistical Applications and Methods
    • /
    • v.17 no.2
    • /
    • pp.205-222
    • /
    • 2010
  • A patent is a right of intellectual properties to an inventor or its assignee for a limited period under an international law. Not only in an invention of new machines, but it is competitive for using and creating technology in the world based on the patents. Most of the business models are good examples for patented technology, however a statistical analyzing model could be another one. In this paper we study and analyze the patents for the statistical analyzing and data mining models which are currently applied and registered, and suggest a statistical tool for analyzing and categorizing patent data. For this study all the patents in Korea and U.S. are listed and searched to sample the only cases concerning statistics.

Application of Bayesian Statistical Analysis to Multisource Data Integration

  • Hong, Sa-Hyun;Moon, Wooil-M.
    • Proceedings of the KSRS Conference
    • /
    • 2002.10a
    • /
    • pp.394-399
    • /
    • 2002
  • In this paper, Multisource data classification methods based on Bayesian formula are considered. For this decision fusion scheme, the individual data sources are handled separately by statistical classification algorithms and then Bayesian fusion method is applied to integrate from the available data sources. This method includes the combination of each expert decisions where the weights of the individual experts represent the reliability of the sources. The reliability measure used in the statistical approach is common to all pixels in previous work. In this experiment, the weight factors have been assigned to have different value for all pixels in order to improve the integrated classification accuracies. Although most implementations of Bayesian classification approaches assume fixed a priori probabilities, we have used adaptive a priori probabilities by iteratively calculating the local a priori probabilities so as to maximize the posteriori probabilities. The effectiveness of the proposed method is at first demonstrated on simulations with artificial and evaluated in terms of real-world data sets. As a result, we have shown that Bayesian statistical fusion scheme performs well on multispectral data classification.

  • PDF

A study on the behavior of cosmetic customers (화장품구매 자료를 통한 고객 구매행태 분석)

  • Cho, Dae-Hyeon;Kim, Byung-Soo;Seok, Kyung-Ha;Lee, Jong-Un;Kim, Jong-Sung;Kim, Sun-Hwa
    • Journal of the Korean Data and Information Science Society
    • /
    • v.20 no.4
    • /
    • pp.615-627
    • /
    • 2009
  • In micro marketing promotion, it is important to know the behavior of customers. In this study we are interested in the forecasting of repurchase of customers from customers' behavior. By analyzing the cosmetic transaction data we derive some variables which play an important role in the knowledge of the customers' behavior and in the modeling of repurchase. As modeling tools we use the decision tree, logistic regression and neural network model. Finally we decide to use the decision tree as a final model since it yields the smallest RASE (root average squared error) and the greatest correct classification rate.

  • PDF

Incremental Multi-classification by Least Squares Support Vector Machine

  • Oh, Kwang-Sik;Shim, Joo-Yong;Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • v.14 no.4
    • /
    • pp.965-974
    • /
    • 2003
  • In this paper we propose an incremental classification of multi-class data set by LS-SVM. By encoding the output variable in the training data set appropriately, we obtain a new specific output vectors for the training data sets. Then, online LS-SVM is applied on each newly encoded output vectors. Proposed method will enable the computation cost to be reduced and the training to be performed incrementally. With the incremental formulation of an inverse matrix, the current information and new input data are used for building another new inverse matrix for the estimation of the optimal bias and lagrange multipliers. Computational difficulties of large scale matrix inversion can be avoided. Performance of proposed method are shown via numerical studies and compared with artificial neural network.

  • PDF

Statistical Analysis of K-League Data using Poisson Model

  • Kim, Yang-Jin
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.5
    • /
    • pp.775-783
    • /
    • 2012
  • Several statistical models for bivariate poisson data are suggested and used to analyze 2011 K-league data. Our interest is composed of two purposes: The first purpose is to exploit potential attacking and defensive abilities of each team. Particular, a bivariate poisson model with diagonal inflation is incorporated for the estimation of draws. A joint model is applied to estimate an association between poisson distribution and probability of draw. The second one is to investigate causes on scoring time of goals and a regression technique of recurrent event data is applied. Some related future works are suggested.

Restricted maximum likelihood estimation of a censored random effects panel regression model

  • Lee, Minah;Lee, Seung-Chun
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.4
    • /
    • pp.371-383
    • /
    • 2019
  • Panel data sets have been developed in various areas, and many recent studies have analyzed panel, or longitudinal data sets. Maximum likelihood (ML) may be the most common statistical method for analyzing panel data models; however, the inference based on the ML estimate will have an inflated Type I error because the ML method tends to give a downwardly biased estimate of variance components when the sample size is small. The under estimation could be severe when data is incomplete. This paper proposes the restricted maximum likelihood (REML) method for a random effects panel data model with a censored dependent variable. Note that the likelihood function of the model is complex in that it includes a multidimensional integral. Many authors proposed to use integral approximation methods for the computation of likelihood function; however, it is well known that integral approximation methods are inadequate for high dimensional integrals in practice. This paper introduces to use the moments of truncated multivariate normal random vector for the calculation of multidimensional integral. In addition, a proper asymptotic standard error of REML estimate is given.

Application of Multivariate Statistical Analysis Technique in Landfill Investigation (매립물 특성 조사를 위한 다변량 통계분석 기법의 응용)

  • Kwon, Byung-Doo;Kim, Cha-Soup
    • Journal of the Korean earth science society
    • /
    • v.18 no.6
    • /
    • pp.515-521
    • /
    • 1997
  • To investigate the nature of the waste materials in the Nanjido Landfill, we have conducted multivariate statistical analysis of geophysical data set comprised of magnetic, gravity, LandSat TM thermal band and surface depression measurement data. Because these data sets show different responses to the depth, we have transformed the observed total field magnetic data and gravity data to the residual reduced-to-pole(RTP) magnetic anomalies and the three dimensional density anomalies, respectively, and utilized the informations about the upper shallow part of the landfills only in the following process. For the statistical analysis at the points of depression measurement, the magnetic, density and LandSat data values at these points are determined by interpolation process. Since the multivarite statistical analysis technique utilizes a clustering algorithm for classification of data set and we have measured the dissimilarity between objects by using Euclidean distance, standardization was applied prior to distance calculation in order to eliminate any scaling effects due to different measurement unit of each data set. The hierarchial grouping technique was used to construct the dendrogram. The optimum number of statistical groups(clusters), which are classified on the basis of geophysical and geotechnical characteristics, appeared to be six on the resulting dendrogram. The result of this study suggests that the dimension and nature of the multicomponent waste landfills can be identified by application of the multivarite statistical analysis technique to integrated geophysical data sets.

  • PDF