Browse > Article
http://dx.doi.org/10.5626/JCSE.2013.7.2.99

Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning  

Sugiyama, Masashi (Department of Computer Science, Tokyo Institute of Technology)
Liu, Song (Department of Computer Science, Tokyo Institute of Technology)
du Plessis, Marthinus Christoffel (Department of Computer Science, Tokyo Institute of Technology)
Yamanaka, Masao (Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology)
Yamada, Makoto (NTT Communication Science Laboratories, NTT Corporation)
Suzuki, Taiji (Department of Mathematical Informatics, The University of Tokyo)
Kanamori, Takafumi (Department of Computer Science and Mathematical Informatics, Nagoya University)
Publication Information
Journal of Computing Science and Engineering / v.7, no.2, 2013 , pp. 99-111 More about this Journal
Abstract
Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as two-sample homogeneity testing, change-point detection, and class-balance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive two-step approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the Kullback-Leibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the $L^2$-distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.
Keywords
Machine learning; Probability distributions; Kullback-Leibler divergence; Pearson divergence; $L^2$-distance;
Citations & Related Records
연도 인용수 순위
  • Reference
1 M. Yamada and M. Sugiyama, "Direct density-ratio estimation with dimensionality reduction via hetero-distributional subspace analysis," in Proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, 2011, pp. 549-554.
2 M. Yamada, M. Sugiyama, G. Wichern, and J. Simm, "Direct importance estimation with a mixture of probabilistic principal component analyzers," IEICE Transactions on Information and Systems, vol. 93, no. 10, pp. 2846-2849, 2010.
3 A. Keziou, "Dual representation of $\phi$-divergences and applications," Comptes Rendus Mathematique, vol. 336, no. 10, pp. 857-862, 2003.   DOI   ScienceOn
4 R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton University Press, 1970.
5 R. Tibshirani, "Regression shrinkage and subset selection with the lasso," Journal of the Royal Statistical Society B, vol. 58, no. 1, pp. 267-288, 1996.
6 R. Tomioka, T. Suzuki, and M. Sugiyama, "Super-linear convergence of dual augmented Lagrangian algorithm for sparsity regularized estimation," Journal of Machine Learning Research, vol. 12, pp. 1537-1586, 2011.
7 B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, "Least angle regression," The Annals of Statistics, vol. 32, no. 2, pp. 407-499, 2004.   DOI   ScienceOn
8 O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. Cambridge, MA: MIT Press, 2006.
9 R. Rifkin, G. Yeo, and T. Poggio, "Regularized least-squares classification," in Advances in Learning Theory: Methods, Models and Applications, J. A. K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, Eds. Amsterdam, the Netherlands: IOS Press, 2003, pp. 131-154.
10 M. Sugiyama, M. Krauledat, and K. R. Muller, "Covariate shift adaptation by importance weighted cross validation," Journal of Machine Learning Research, vol. 8, pp. 985- 1005, 2007.
11 T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H. Y. Shum, "Learning to detect a salient object," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 353-367, 2011.   DOI   ScienceOn
12 C. E. Shannon, "A mathematical theory of communication," Bell Systems Technical Journal, vol. 27, pp. 379-423, 1948.   DOI
13 T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., Hoboken, NJ: John Wiley & Sons Inc., 2006.
14 K. Torkkola, "Feature extraction by non-parametric mutual information maximization," Journal of Machine Learning Research, vol. 3, pp. 1415-1438, 2003.
15 J. Sainui and M. Sugiyama, "Direct approximation of quadratic mutual information and its application to dependencemaximization clustering," IEICE Transactions on Information and Systems, 2013, submitted for publication.
16 M. Sugiyama, M. Kawanabe, and P. L. Chui, "Dimensionality reduction for density ratio estimation in high-dimensional spaces," Neural Networks, vol. 23, no. 1, pp. 44-59, 2010.   DOI   ScienceOn
17 M. Sugiyama, M. Yamada, P. von Bunau, T. Suzuki, T. Kanamori, and M. Kawanabe, "Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search," Neural Networks, vol. 24, no. 2, pp. 183-198, 2011.   DOI   ScienceOn
18 M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe, "Direct importance estimation for covariate shift adaptation," Annals of the Institute of Statistical Mathematics, vol. 60, no. 4, pp. 699-746, 2008.   DOI   ScienceOn
19 T. Kanamori, S. Hido, and M. Sugiyama, "A least-squares approach to direct importance estimation," Journal of Machine Learning Research, vol. 10, pp. 1391-1445, 2009.
20 X. Nguyen, M. J. Wainwright, and M. I. Jordan, "Estimating divergence functionals and the likelihood ratio by convex risk minimization," IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847-5861, 2010.   DOI   ScienceOn
21 M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama, "Relative density-ratio estimation for robust distribution comparison," Neural Computation, vol. 25, no. 5, pp. 1324-1370, 2013.   DOI   ScienceOn
22 M. Sugiyama, T. Suzuki, T. Kanamori, M. C. du Plessis, S. Liu, and I. Takeuchi, "Density difference estimation," Neural Computation, 2013, to appear.
23 S. Kullback and R. A. Leibler, "On information and sufficiency," The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79-86, 1951.   DOI   ScienceOn
24 S. Amari and H. Nagaoka, Methods of Information Geometry, Providence, RI: American Mathematical Society, 2000.
25 M. Sugiyama, T. Suzuki, and T. Kanamori, Density Ratio Estimation in Machine Learning, New York, NY: Cambridge University Press, 2012.
26 C. Cortes, Y. Mansour, and M. Mohri, "Learning bounds for importance weighting," in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, and A. Culotta, Eds., La Jolla, CA: Neural Information Processing Systems, 2010, pp. 442-450.
27 K. Pearson, "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling," Philosophical Magazine, vol. 50, no. 302, pp. 157-175, 1900.   DOI
28 S. M. Ali and S. D. Silvey, "A general class of coefficients of divergence of one distribution from another," Journal of the Royal Statistical Society B, vol. 28, no. 1, pp. 131-142, 1966.
29 M. Sugiyama, T. Suzuki, and T. Kanamori, "Density ratio matching under the Bregman divergence: a unified framework of density ratio estimation," Annals of the Institute of Statistical Mathematics, vol. 64, no. 5, pp. 1009-1044, 2012.   DOI
30 I. Csiszar, "Information-type measures of difference of probability distributions and indirect observation," Studia Scientiarum Mathematicarum Hungarica, vol. 2, pp. 299-318, 1967.
31 Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama, "Direct density ratio estimation for large-scale covariate shift adaptation," Information and Media Technologies, vol. 4, no. 2, pp. 529-546, 2009.
32 M. Yamada and M. Sugiyama, "Direct importance estima tion with Gaussian mixture models," IEICE Transactions on Information and Systems, vol. 92, no. 10, pp. 2159-2162, 2009.
33 M. Yamanaka, M. Matsugu, and M. Sugiyama, "Salient object detection based on direct density ratio estimation," IPSJ Transactions on Mathematical Modeling and Its Applications, 2013, to appear.
34 M. Yamanaka, M. Matsugu, and M. Sugiyama, "Detection of activities and events without explicit categorization," IPSJ Transactions on Mathematical Modeling and Its Applications, 2013, to appear.
35 S. Liu, M. Yamada, N. Collier, and M. Sugiyama, "Changepoint detection in time-series data by relative density-ratio estimation," Neural Networks, vol. 43, pp. 72-83, 2013.   DOI   ScienceOn
36 M. Sugiyama, "Machine learning with squared-loss mutual information," Entropy, vol. 15, no. 1, pp. 80-112, 2013.
37 M. Sugiyama and T. Suzuki, "Least-squares independence test," IEICE Transactions on Information and Systems, vol. 94, no. 6, pp. 1333-1336, 2011.
38 T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese, "Mutual information estimation reveals global associations between stimuli and biological processes," BMC Bioinformatics, vol. 10, no. 1, p. S52, 2009.   DOI   ScienceOn
39 T. Suzuki and M. Sugiyama, "Sufficient dimension reduction via squared-loss mutual information estimation," Neural Computation, vol. 25, no. 3, pp. 725-758, 2013.   DOI   ScienceOn
40 W. Jitkrittum, H. Hachiya, and M. Sugiyama, "Feature selection via L1-penalized squared-loss mutual information," IEICE Transactions on Information and Systems, 2013, to appear.
41 M. Yamada, G. Niu, J. Takagi, and M. Sugiyama, "Computationally efficient sufficient dimension reduction via squaredloss mutual information," JMLR Workshop and Conference Proceedings, vol. 20, pp. 247-262, 2011.
42 M. Karasuyama and Sugiyama, "Canonical dependency analysis based on squared-loss mutual information," Neural Networks, vol. 34, pp. 46-55, 2012.   DOI   ScienceOn
43 M. Yamada and M. Sugiyama, "Cross-domain object matching with model selection," JMLR Workshop and Conference Proceedings, vol. 15, pp. 807-815, 2011.
44 T. Suzuki and M. Sugiyama, "Least-squares independent component analysis," Neural Computation, vol. 23, no. 1, pp. 284-301, 2011.   DOI   ScienceOn
45 M. Sugiyama, M. Yamada, M. Kimura, and H. Hachiya, "On information-maximization clustering: tuning parameter selection and analytic solution," in Proceedings of the 28th International Conference on Machine Learning, Washington, DC, 2011, pp. 65-72.
46 M. Kimura and M. Sugiyama, "Dependence maximization clustering with least-squares mutual information," Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 15, no. 7, pp. 800-805, 2011.   DOI
47 M. Yamada and M. Sugiyama, "Dependence minimizing regression with model selection for nonlinear causal inference under non-Gaussian noise," in Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, GA, 2010, pp. 643-648.
48 V. N. Vapnik, Statistical Learning Theory, New York, NY: Wiley, 1998.
49 T. Kanamori, T. Suzuki, and M. Sugiyama, "f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models," IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 708-720, 2012.   DOI   ScienceOn
50 M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura, "Least-squares two-sample test," Neural Networks, vol. 24, no. 7, pp. 735-751, 2011.   DOI   ScienceOn
51 Y. Kawahara and M. Sugiyama, "Sequential change-point detection based on direct density-ratio estimation," Statistical Analysis and Data Mining, vol. 5, no. 2, pp. 114-127, 2012.   DOI
52 M. C. du Plessis and M. Sugiyama, "Semi-supervised learning of class balance under class-prior change by distribution matching," in Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, 2012, pp. 823-830.