[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5626/JCSE.2013.7.2.99

Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

Sugiyama, Masashi (Department of Computer Science, Tokyo Institute of Technology)
Liu, Song (Department of Computer Science, Tokyo Institute of Technology)
du Plessis, Marthinus Christoffel (Department of Computer Science, Tokyo Institute of Technology)
Yamanaka, Masao (Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology)
Yamada, Makoto (NTT Communication Science Laboratories, NTT Corporation)
Suzuki, Taiji (Department of Mathematical Informatics, The University of Tokyo)
Kanamori, Takafumi (Department of Computer Science and Mathematical Informatics, Nagoya University)

Publication Information

Journal of Computing Science and Engineering / v.7, no.2, 2013 , pp. 99-111 More about this Journal

Abstract

Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as two-sample homogeneity testing, change-point detection, and class-balance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive two-step approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the Kullback-Leibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the $L^2$ -distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.

Keywords

Machine learning; Probability distributions; Kullback-Leibler divergence; Pearson divergence; $L^2$ -distance;

Citations & Related Records

Reference

1	M. Yamada and M. Sugiyama, "Direct density-ratio estimation with dimensionality reduction via hetero-distributional subspace analysis," in Proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, 2011, pp. 549-554.
2	M. Yamada, M. Sugiyama, G. Wichern, and J. Simm, "Direct importance estimation with a mixture of probabilistic principal component analyzers," IEICE Transactions on Information and Systems, vol. 93, no. 10, pp. 2846-2849, 2010.
3	A. Keziou, "Dual representation of $\phi$ -divergences and applications," Comptes Rendus Mathematique, vol. 336, no. 10, pp. 857-862, 2003. DOI ScienceOn
4	R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton University Press, 1970.
5	R. Tibshirani, "Regression shrinkage and subset selection with the lasso," Journal of the Royal Statistical Society B, vol. 58, no. 1, pp. 267-288, 1996.
6	R. Tomioka, T. Suzuki, and M. Sugiyama, "Super-linear convergence of dual augmented Lagrangian algorithm for sparsity regularized estimation," Journal of Machine Learning Research, vol. 12, pp. 1537-1586, 2011.
7	B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, "Least angle regression," The Annals of Statistics, vol. 32, no. 2, pp. 407-499, 2004. DOI ScienceOn
8	O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. Cambridge, MA: MIT Press, 2006.
9	R. Rifkin, G. Yeo, and T. Poggio, "Regularized least-squares classification," in Advances in Learning Theory: Methods, Models and Applications, J. A. K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, Eds. Amsterdam, the Netherlands: IOS Press, 2003, pp. 131-154.
10	M. Sugiyama, M. Krauledat, and K. R. Muller, "Covariate shift adaptation by importance weighted cross validation," Journal of Machine Learning Research, vol. 8, pp. 985- 1005, 2007.
11	T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H. Y. Shum, "Learning to detect a salient object," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 353-367, 2011. DOI ScienceOn
12	C. E. Shannon, "A mathematical theory of communication," Bell Systems Technical Journal, vol. 27, pp. 379-423, 1948. DOI
13	T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., Hoboken, NJ: John Wiley & Sons Inc., 2006.
14	K. Torkkola, "Feature extraction by non-parametric mutual information maximization," Journal of Machine Learning Research, vol. 3, pp. 1415-1438, 2003.
15	J. Sainui and M. Sugiyama, "Direct approximation of quadratic mutual information and its application to dependencemaximization clustering," IEICE Transactions on Information and Systems, 2013, submitted for publication.
16	M. Sugiyama, M. Kawanabe, and P. L. Chui, "Dimensionality reduction for density ratio estimation in high-dimensional spaces," Neural Networks, vol. 23, no. 1, pp. 44-59, 2010. DOI ScienceOn
17	M. Sugiyama, M. Yamada, P. von Bunau, T. Suzuki, T. Kanamori, and M. Kawanabe, "Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search," Neural Networks, vol. 24, no. 2, pp. 183-198, 2011. DOI ScienceOn
18	M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe, "Direct importance estimation for covariate shift adaptation," Annals of the Institute of Statistical Mathematics, vol. 60, no. 4, pp. 699-746, 2008. DOI ScienceOn
19	T. Kanamori, S. Hido, and M. Sugiyama, "A least-squares approach to direct importance estimation," Journal of Machine Learning Research, vol. 10, pp. 1391-1445, 2009.
20	X. Nguyen, M. J. Wainwright, and M. I. Jordan, "Estimating divergence functionals and the likelihood ratio by convex risk minimization," IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847-5861, 2010. DOI ScienceOn
21	M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama, "Relative density-ratio estimation for robust distribution comparison," Neural Computation, vol. 25, no. 5, pp. 1324-1370, 2013. DOI ScienceOn
22	M. Sugiyama, T. Suzuki, T. Kanamori, M. C. du Plessis, S. Liu, and I. Takeuchi, "Density difference estimation," Neural Computation, 2013, to appear.
23	S. Kullback and R. A. Leibler, "On information and sufficiency," The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79-86, 1951. DOI ScienceOn
24	S. Amari and H. Nagaoka, Methods of Information Geometry, Providence, RI: American Mathematical Society, 2000.
25	M. Sugiyama, T. Suzuki, and T. Kanamori, Density Ratio Estimation in Machine Learning, New York, NY: Cambridge University Press, 2012.
26	C. Cortes, Y. Mansour, and M. Mohri, "Learning bounds for importance weighting," in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, and A. Culotta, Eds., La Jolla, CA: Neural Information Processing Systems, 2010, pp. 442-450.
27	K. Pearson, "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling," Philosophical Magazine, vol. 50, no. 302, pp. 157-175, 1900. DOI
28	S. M. Ali and S. D. Silvey, "A general class of coefficients of divergence of one distribution from another," Journal of the Royal Statistical Society B, vol. 28, no. 1, pp. 131-142, 1966.
29	M. Sugiyama, T. Suzuki, and T. Kanamori, "Density ratio matching under the Bregman divergence: a unified framework of density ratio estimation," Annals of the Institute of Statistical Mathematics, vol. 64, no. 5, pp. 1009-1044, 2012. DOI
30	I. Csiszar, "Information-type measures of difference of probability distributions and indirect observation," Studia Scientiarum Mathematicarum Hungarica, vol. 2, pp. 299-318, 1967.
31	Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama, "Direct density ratio estimation for large-scale covariate shift adaptation," Information and Media Technologies, vol. 4, no. 2, pp. 529-546, 2009.
32	M. Yamada and M. Sugiyama, "Direct importance estima tion with Gaussian mixture models," IEICE Transactions on Information and Systems, vol. 92, no. 10, pp. 2159-2162, 2009.
33	M. Yamanaka, M. Matsugu, and M. Sugiyama, "Salient object detection based on direct density ratio estimation," IPSJ Transactions on Mathematical Modeling and Its Applications, 2013, to appear.
34	M. Yamanaka, M. Matsugu, and M. Sugiyama, "Detection of activities and events without explicit categorization," IPSJ Transactions on Mathematical Modeling and Its Applications, 2013, to appear.
35	S. Liu, M. Yamada, N. Collier, and M. Sugiyama, "Changepoint detection in time-series data by relative density-ratio estimation," Neural Networks, vol. 43, pp. 72-83, 2013. DOI ScienceOn
36	M. Sugiyama, "Machine learning with squared-loss mutual information," Entropy, vol. 15, no. 1, pp. 80-112, 2013.
37	M. Sugiyama and T. Suzuki, "Least-squares independence test," IEICE Transactions on Information and Systems, vol. 94, no. 6, pp. 1333-1336, 2011.
38	T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese, "Mutual information estimation reveals global associations between stimuli and biological processes," BMC Bioinformatics, vol. 10, no. 1, p. S52, 2009. DOI ScienceOn
39	T. Suzuki and M. Sugiyama, "Sufficient dimension reduction via squared-loss mutual information estimation," Neural Computation, vol. 25, no. 3, pp. 725-758, 2013. DOI ScienceOn
40	W. Jitkrittum, H. Hachiya, and M. Sugiyama, "Feature selection via L1-penalized squared-loss mutual information," IEICE Transactions on Information and Systems, 2013, to appear.
41	M. Yamada, G. Niu, J. Takagi, and M. Sugiyama, "Computationally efficient sufficient dimension reduction via squaredloss mutual information," JMLR Workshop and Conference Proceedings, vol. 20, pp. 247-262, 2011.
42	M. Karasuyama and Sugiyama, "Canonical dependency analysis based on squared-loss mutual information," Neural Networks, vol. 34, pp. 46-55, 2012. DOI ScienceOn
43	M. Yamada and M. Sugiyama, "Cross-domain object matching with model selection," JMLR Workshop and Conference Proceedings, vol. 15, pp. 807-815, 2011.
44	T. Suzuki and M. Sugiyama, "Least-squares independent component analysis," Neural Computation, vol. 23, no. 1, pp. 284-301, 2011. DOI ScienceOn
45	M. Sugiyama, M. Yamada, M. Kimura, and H. Hachiya, "On information-maximization clustering: tuning parameter selection and analytic solution," in Proceedings of the 28th International Conference on Machine Learning, Washington, DC, 2011, pp. 65-72.
46	M. Kimura and M. Sugiyama, "Dependence maximization clustering with least-squares mutual information," Journal of Advanced Computational Intelligence and Intelligent Informatics, vol. 15, no. 7, pp. 800-805, 2011. DOI
47	M. Yamada and M. Sugiyama, "Dependence minimizing regression with model selection for nonlinear causal inference under non-Gaussian noise," in Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, GA, 2010, pp. 643-648.
48	V. N. Vapnik, Statistical Learning Theory, New York, NY: Wiley, 1998.
49	T. Kanamori, T. Suzuki, and M. Sugiyama, "f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models," IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 708-720, 2012. DOI ScienceOn
50	M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura, "Least-squares two-sample test," Neural Networks, vol. 24, no. 7, pp. 735-751, 2011. DOI ScienceOn
51	Y. Kawahara and M. Sugiyama, "Sequential change-point detection based on direct density-ratio estimation," Statistical Analysis and Data Mining, vol. 5, no. 2, pp. 114-127, 2012. DOI
52	M. C. du Plessis and M. Sugiyama, "Semi-supervised learning of class balance under class-prior change by distribution matching," in Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, 2012, pp. 823-830.

Reference

2	Takafumi Kanamori. (2014) Entropy Statistical Analysis of Distance Estimators with Density Differences and Density Ratios / 16 (2) , 921
1	Sourabh Banerjee. (2015) SIAM/ASA Journal on Uncertainty Quantification Minimum Distance Estimation of Milky Way Model Parameters and Related Inference / 3 (1) , 91
	Subhadeep Paul. (2015) Computational Statistics & Data Analysis On second order efficient robust inference / 88 , 187
6	Song Liu. (2014) Neural Computation Direct Learning of Sparse Changes in Markov Networks by Density Ratio Estimation / 26 (6) , 1169
1	Hideko KAWAKUBO. (2016) IEICE Transactions on Information and Systems Computationally Efficient Class-Prior Estimation under Class Balance Change Using Energy Distance / E99.D (1) , 176
5	Hyunha NAM. (2015) IEICE Transactions on Information and Systems Direct Density Ratio Estimation with Convolutional Neural Networks with Application in Outlier Detection / E98.D (5) , 1073
2	Hye-Wuk Jung. (2015) Pattern Recognition Noisy and incomplete fingerprint classification using local ridge distribution models / 48 (2) , 473
2	(2018) ACM Transactions on Sensor Networks Non-Bayesian Social Learning with Observation Reuse and Soft Switching / 14 (2) , 1