• 제목/요약/키워드: Pearson divergence

검색결과 8건 처리시간 0.022초

Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

  • Sugiyama, Masashi;Liu, Song;du Plessis, Marthinus Christoffel;Yamanaka, Masao;Yamada, Makoto;Suzuki, Taiji;Kanamori, Takafumi
    • Journal of Computing Science and Engineering
    • /
    • 제7권2호
    • /
    • pp.99-111
    • /
    • 2013
  • Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as two-sample homogeneity testing, change-point detection, and class-balance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive two-step approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the Kullback-Leibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the $L^2$-distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.

분포유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구 (Improving the Performance of Document Clustering with Distributional Similarities)

  • 이재윤
    • 정보관리학회지
    • /
    • 제24권4호
    • /
    • pp.267-283
    • /
    • 2007
  • 이 연구에서는 분포 유사도를 문헌 클러스터링에 적용하여 전통적인 코사인 유사도 공식을 대체할 수 있는 가능성을 모색해보았다. 대표적인 분포 유사도인 KL 다이버전스 공식을 변형한 Jansen-Shannon 다이버전스, 대칭적 스큐 다이버전스, 최소스큐 다이버전스의 세 가지 공식을 문헌 벡터에 적용하는 방안을 고안하였다. 분포 유사도를 적용한 문헌 클러스터링 성능을 검증하기 위해서 세 실험 집단을 대상으로 두 가지 실험을 준비하여 실행하였다. 첫 번째 문헌클러스터링실험에서는 최소스큐다이버전스가 코사인 유사도 뿐만 아니라 다른 다이버전스공식의 성능도 확연히 앞서는 뛰어난 성능을 보였다. 두번째 실험에서는 피어슨 상관계수를 이용하여1차 유사도 행렬로부터2차 분포 유사도를 산출하여 문헌 클러스터링을 수행하였다. 실험결과는 2차 분포 유사도가 전반적으로더 좋은 문헌 클러스터링성능을 보이는 것으로 나타났다. 문헌클러스터링에서 처리 시간과 분류 성능을 함께 고려한다면 이 연구에서 제안한 최소 스큐 다이버전스 공식을 사용하고, 분류 성능만 고려할 경우에는 2차 분포 유사도 방식을 사용하는 것이 바람직하다고 판단된다.

Empirical Comparisons of Disparity Measures for Partial Association Models in Three Dimensional Contingency Tables

  • Jeong, D.B.;Hong, C.S.;Yoon, S.H.
    • Communications for Statistical Applications and Methods
    • /
    • 제10권1호
    • /
    • pp.135-144
    • /
    • 2003
  • This work is concerned with comparison of the recently developed disparity measures for the partial association model in three dimensional categorical data. Data are generated by using simulation on each term in the log-linear model equation based on the partial association model, which is a proposed method in this paper. This alternative Monte Carlo methods are explored to study the behavior of disparity measures such as the power divergence statistic I(λ), the Pearson chi-square statistic X$^2$, the likelihood ratio statistic G$^2$, the blended weight chi-square statistic BWCS(λ), the blended weight Hellinger distance statistic BWHD(λ), and the negative exponential disparity statistic NED(λ) for moderate sample sizes. We find that the power divergence statistic I(2/3) and the blended weight Hellinger distance family BWHD(1/9) are the best tests with respect to size and power.

Empirical Comparisons of Disparity Measures for Three Dimensional Log-Linear Models

  • Park, Y.S.;Hong, C.S.;Jeong, D.B.
    • Journal of the Korean Data and Information Science Society
    • /
    • 제17권2호
    • /
    • pp.543-557
    • /
    • 2006
  • This paper is concerned with the applicability of the chi-square approximation to the six disparity statistics: the Pearson chi-square, the generalized likelihood ratio, the power divergence, the blended weight chi-square, the blended weight Hellinger distance, and the negative exponential disparity statistic. Three dimensional contingency tables of small and moderate sample sizes are generated to be fitted to all possible hierarchical log-linear models: the completely independent model, the conditionally independent model, the partial association models, and the model with one variable independent of the other two. For models with direct solutions of expected cell counts, point estimates and confidence intervals of the 90 and 95 percentage points of six statistics are explored. For model without direct solutions, the empirical significant levels and the empirical powers of six statistics to test the significance of the three factor interaction are computed and compared.

  • PDF

Generalized Measure of Departure From Global Symmetry for Square Contingency Tables with Ordered Categories

  • Tomizawa, Sadao;Saitoh, Kayo
    • Journal of the Korean Statistical Society
    • /
    • 제27권3호
    • /
    • pp.289-303
    • /
    • 1998
  • For square contingency tables with ordered categories, Tomizawa (1995) considered two kinds of measures to represent the degree of departure from global symmetry, which means that the probability that an observation will fall in one of cells in the upper-right triangle of square table is equal to the probability that the observation falls in one of cells in the lower-left triangle of it. This paper proposes a generalization of those measures. The proposed measure is expressed by using Cressie and Read's (1984) power divergence or Patil and Taillie's (1982) diversity index. Special cases of the proposed measure include TomiBawa's measures. The proposed measure would be useful for comparing the degree of departure from global symmetry in several tables.

  • PDF

소표본에서 차이측도 통계량의 비교연구 (A Monte Carlo Comparison of the Small Sample Behavior of Disparity Measures)

  • 홍종선;정동빈;박용석
    • 응용통계연구
    • /
    • 제16권2호
    • /
    • pp.455-467
    • /
    • 2003
  • 소표본 분할표 자료에서 적합도 검정통계량들의 카이제곱 근사 적용 가능에 대하여 많은 연구가 진행되었다. 소표본에서 세 가지 검정 통계량(피어슨 카이제곱 Χ$^2$, 일반화 가능도비 G$^2$, 그리고 역발산 Ι(2/3) 검정통계량)에 관하여 비교한 Rudas(1986)의 연구를 확장하여, 최근에 제안된 차이측도(BWHD(1/9), BWCS(1/3), NED(4/3) 검정통계량)를 포함시켜 비교 분석하였다. 독립모형의 이차원 분할표, 조건부 독립모형과 한 변수 독립 모형을 따르는 삼차원 분할표에 대한 모의실험을 통하여 생성된 90과 95 백분위수와 이에 대응하는 95% 신뢰구간을 살펴보고 실제 백분위수와 비교하였다. 그 결과 Χ$^2$, Ι(2/3), 그리고 BWHD(1/9) 검정통계량이 유사한 결과를 나타내었고 이 통계량들이 기존에 제안된 검정통계량들보다 적은 표본크기에서도 카이제곱 근사방법에 적용 가능함을 발견하였다.

The Effect of Conflict with the Apparel Manufacturer on Satisfaction of the Frsnchised Agency in the Apparel Industry

  • Jung, Chan-Jean;Kim, Soo-Jin;Ju, Seong-Rae
    • The International Journal of Costume Culture
    • /
    • 제3권1호
    • /
    • pp.41-52
    • /
    • 2000
  • The Purposes of this study ar (1) to identify types and levels of channel conflicts between an apparel manufacturer and a franchised agency, (2) to investigate the effect of economic dependence on conflicts, and (3) to examine the effect of conflicts on satisfaction in a franchised agency's perspective in distributive channel of Korean apparel industry. For this study, questionnaires were administered to the owner or manager of 300 franchised agencies. Employing a sample of 209, data were analyzed by using means, factor analysis, pearson correlation and multi-regression analysis. Major findings are as follows: 1) Types of conflicts between apparel manufacturers and franchised agencies are identified as goal divergence, difference in perception, ineffective communication and lack of role clarity. The highest level of conflicts are lack of role clarity, followed by goal divergence, difference in perception and ineffective communication. 2) Economic dependence leads to channel conflicts in part. Greater levels of economic dependence foster greater conflicts such as lack of role clarity and lower conflicts such as ineffective communication. 3) With respect to effect of conflict on satisfaction, the greater the levels of conflict, the lower the degree of satisfaction with ole performance and with business decision and overall satisfaction.

  • PDF

도시대기측정망 자료를 이용한 대구지역 대기오염물질의 공간분포에 관한 연구 (A Study for Spatial Distribution of Principal Pollutants in Daegu Area Using Air Pollution Monitoring Network Data)

  • 주재희;황인조
    • 한국대기환경학회지
    • /
    • 제27권5호
    • /
    • pp.545-557
    • /
    • 2011
  • The objective of this study was to estimate the trends of each pollutant using the air pollution monitoring networks data from January 2005 to December 2008 in Daegu area. Also, the spatial characteristics of each pollutant were determined using the Pearson correlation coefficients and COD (coefficients of divergence). In this study, the trends of hourly, monthly, seasonal, and total average concentrations of each pollutant for the 10 sites were analyzed. The Ihyeon site showed highest concentration for the $SO_2$, $NO_2$, and PM10}. In the case of $O_3$, the Jisan site showed highest concentration among the other sites. Also, industrial area presented highest concentration for the $SO_2$, CO, and PM10. On the other hand, $NO_2$ showed highest in commercial area. The IDW (inverse distance weighting) method was used to estimate characteristics of spatial distribution. The results provide identify spatial distribution for each pollutant. Also, the Pearson correlation coefficients and COD values provide spatial variability among the monitoring sites. The COD of each pollutant showed very low values for all of the sites pairs. On the other hand, the Pearson correlation coefficients showed high values for all of the sites pairs. Finally, analysis of spatial variability can be used to characterize the spatial uniformity and similarity of concentrations from each pollutant.