• Title/Summary/Keyword: Pearson divergence

Search Result 8, Processing Time 0.027 seconds

Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

  • Sugiyama, Masashi;Liu, Song;du Plessis, Marthinus Christoffel;Yamanaka, Masao;Yamada, Makoto;Suzuki, Taiji;Kanamori, Takafumi
    • Journal of Computing Science and Engineering
    • /
    • v.7 no.2
    • /
    • pp.99-111
    • /
    • 2013
  • Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as two-sample homogeneity testing, change-point detection, and class-balance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive two-step approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the Kullback-Leibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the $L^2$-distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.

Improving the Performance of Document Clustering with Distributional Similarities (분포유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구)

  • Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.4
    • /
    • pp.267-283
    • /
    • 2007
  • In this study, measures of distributional similarity such as KL-divergence are applied to cluster documents instead of traditional cosine measure, which is the most prevalent vector similarity measure for document clustering. Three variations of KL-divergence are investigated; Jansen-Shannon divergence, symmetric skew divergence, and minimum skew divergence. In order to verify the contribution of distributional similarities to document clustering, two experiments are designed and carried out on three test collections. In the first experiment the clustering performances of the three divergence measures are compared to that of cosine measure. The result showed that minimum skew divergence outperformed the other divergence measures as well as cosine measure. In the second experiment second-order distributional similarities are calculated with Pearson correlation coefficient from the first-order similarity matrixes. From the result of the second experiment, secondorder distributional similarities were found to improve the overall performance of document clustering. These results suggest that minimum skew divergence must be selected as document vector similarity measure when considering both time and accuracy, and second-order similarity is a good choice for considering clustering accuracy only.

Empirical Comparisons of Disparity Measures for Partial Association Models in Three Dimensional Contingency Tables

  • Jeong, D.B.;Hong, C.S.;Yoon, S.H.
    • Communications for Statistical Applications and Methods
    • /
    • v.10 no.1
    • /
    • pp.135-144
    • /
    • 2003
  • This work is concerned with comparison of the recently developed disparity measures for the partial association model in three dimensional categorical data. Data are generated by using simulation on each term in the log-linear model equation based on the partial association model, which is a proposed method in this paper. This alternative Monte Carlo methods are explored to study the behavior of disparity measures such as the power divergence statistic I(λ), the Pearson chi-square statistic X$^2$, the likelihood ratio statistic G$^2$, the blended weight chi-square statistic BWCS(λ), the blended weight Hellinger distance statistic BWHD(λ), and the negative exponential disparity statistic NED(λ) for moderate sample sizes. We find that the power divergence statistic I(2/3) and the blended weight Hellinger distance family BWHD(1/9) are the best tests with respect to size and power.

Empirical Comparisons of Disparity Measures for Three Dimensional Log-Linear Models

  • Park, Y.S.;Hong, C.S.;Jeong, D.B.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.2
    • /
    • pp.543-557
    • /
    • 2006
  • This paper is concerned with the applicability of the chi-square approximation to the six disparity statistics: the Pearson chi-square, the generalized likelihood ratio, the power divergence, the blended weight chi-square, the blended weight Hellinger distance, and the negative exponential disparity statistic. Three dimensional contingency tables of small and moderate sample sizes are generated to be fitted to all possible hierarchical log-linear models: the completely independent model, the conditionally independent model, the partial association models, and the model with one variable independent of the other two. For models with direct solutions of expected cell counts, point estimates and confidence intervals of the 90 and 95 percentage points of six statistics are explored. For model without direct solutions, the empirical significant levels and the empirical powers of six statistics to test the significance of the three factor interaction are computed and compared.

  • PDF

Generalized Measure of Departure From Global Symmetry for Square Contingency Tables with Ordered Categories

  • Tomizawa, Sadao;Saitoh, Kayo
    • Journal of the Korean Statistical Society
    • /
    • v.27 no.3
    • /
    • pp.289-303
    • /
    • 1998
  • For square contingency tables with ordered categories, Tomizawa (1995) considered two kinds of measures to represent the degree of departure from global symmetry, which means that the probability that an observation will fall in one of cells in the upper-right triangle of square table is equal to the probability that the observation falls in one of cells in the lower-left triangle of it. This paper proposes a generalization of those measures. The proposed measure is expressed by using Cressie and Read's (1984) power divergence or Patil and Taillie's (1982) diversity index. Special cases of the proposed measure include TomiBawa's measures. The proposed measure would be useful for comparing the degree of departure from global symmetry in several tables.

  • PDF

A Monte Carlo Comparison of the Small Sample Behavior of Disparity Measures (소표본에서 차이측도 통계량의 비교연구)

  • 홍종선;정동빈;박용석
    • The Korean Journal of Applied Statistics
    • /
    • v.16 no.2
    • /
    • pp.455-467
    • /
    • 2003
  • There has been a long debate on the applicability of the chi-square approximation to statistics based on small sample size. Extending comparison results among Pearson chi-square Χ$^2$, generalized likelihood .ratio G$^2$, and the power divergence Ι(2/3) statistics suggested by Rudas(1986), recently developed disparity statistics (BWHD(1/9), BWCS(1/3), NED(4/3)) we compared and analyzed in this paper. By Monte Carlo studies about the independence model of two dimension contingency tables, the conditional model and one variable independence model of three dimensional tables, simulated 90 and 95 percentage points and approximate 95% confidence intervals for the true percentage points are obtained. It is found that the Χ$^2$, Ι(2/3), BWHD(1/9) test statistics have very similar behavior and there seem to be applcable for small sample sizes than others.

The Effect of Conflict with the Apparel Manufacturer on Satisfaction of the Frsnchised Agency in the Apparel Industry

  • Jung, Chan-Jean;Kim, Soo-Jin;Ju, Seong-Rae
    • The International Journal of Costume Culture
    • /
    • v.3 no.1
    • /
    • pp.41-52
    • /
    • 2000
  • The Purposes of this study ar (1) to identify types and levels of channel conflicts between an apparel manufacturer and a franchised agency, (2) to investigate the effect of economic dependence on conflicts, and (3) to examine the effect of conflicts on satisfaction in a franchised agency's perspective in distributive channel of Korean apparel industry. For this study, questionnaires were administered to the owner or manager of 300 franchised agencies. Employing a sample of 209, data were analyzed by using means, factor analysis, pearson correlation and multi-regression analysis. Major findings are as follows: 1) Types of conflicts between apparel manufacturers and franchised agencies are identified as goal divergence, difference in perception, ineffective communication and lack of role clarity. The highest level of conflicts are lack of role clarity, followed by goal divergence, difference in perception and ineffective communication. 2) Economic dependence leads to channel conflicts in part. Greater levels of economic dependence foster greater conflicts such as lack of role clarity and lower conflicts such as ineffective communication. 3) With respect to effect of conflict on satisfaction, the greater the levels of conflict, the lower the degree of satisfaction with ole performance and with business decision and overall satisfaction.

  • PDF

A Study for Spatial Distribution of Principal Pollutants in Daegu Area Using Air Pollution Monitoring Network Data (도시대기측정망 자료를 이용한 대구지역 대기오염물질의 공간분포에 관한 연구)

  • Ju, Jae-Hee;Hwang, In-Jo
    • Journal of Korean Society for Atmospheric Environment
    • /
    • v.27 no.5
    • /
    • pp.545-557
    • /
    • 2011
  • The objective of this study was to estimate the trends of each pollutant using the air pollution monitoring networks data from January 2005 to December 2008 in Daegu area. Also, the spatial characteristics of each pollutant were determined using the Pearson correlation coefficients and COD (coefficients of divergence). In this study, the trends of hourly, monthly, seasonal, and total average concentrations of each pollutant for the 10 sites were analyzed. The Ihyeon site showed highest concentration for the $SO_2$, $NO_2$, and PM10}. In the case of $O_3$, the Jisan site showed highest concentration among the other sites. Also, industrial area presented highest concentration for the $SO_2$, CO, and PM10. On the other hand, $NO_2$ showed highest in commercial area. The IDW (inverse distance weighting) method was used to estimate characteristics of spatial distribution. The results provide identify spatial distribution for each pollutant. Also, the Pearson correlation coefficients and COD values provide spatial variability among the monitoring sites. The COD of each pollutant showed very low values for all of the sites pairs. On the other hand, the Pearson correlation coefficients showed high values for all of the sites pairs. Finally, analysis of spatial variability can be used to characterize the spatial uniformity and similarity of concentrations from each pollutant.