DOI QR코드

DOI QR Code

Deriving a New Divergence Measure from Extended Cross-Entropy Error Function

  • Oh, Sang-Hoon (Division of Information Communication Engineering Mokwon University) ;
  • Wakuya, Hiroshi (Graduate School of Science and Engineering Saga University) ;
  • Park, Sun-Gyu (Division of Architecture Mokwon University) ;
  • Noh, Hwang-Woo (Department of Visual Design Hanbat National University) ;
  • Yoo, Jae-Soo (School of Information and Communication Engineering Chungbuk National University) ;
  • Min, Byung-Won (Division of Information Communication Engineering Mokwon University) ;
  • Oh, Yong-Sun (Division of Information Communication Engineering Mokwon University)
  • Received : 2015.04.13
  • Accepted : 2015.06.01
  • Published : 2015.06.28

Abstract

Relative entropy is a divergence measure between two probability density functions of a random variable. Assuming that the random variable has only two alphabets, the relative entropy becomes a cross-entropy error function that can accelerate training convergence of multi-layer perceptron neural networks. Also, the n-th order extension of cross-entropy (nCE) error function exhibits an improved performance in viewpoints of learning convergence and generalization capability. In this paper, we derive a new divergence measure between two probability density functions from the nCE error function. And the new divergence measure is compared with the relative entropy through the use of three-dimensional plots.

Keywords

1. INTRODUCTION

Multi-layer perceptron (MLP) neural networks can approximate any function with enough number of hidden nodes [1]-[3] and this increases applications of MLPs to wide fields such as pattern recognition, speech recognition, time series prediction, bioinformatics, etc. MLPs are usually trained with the error back-propagation (EBP) algorithm, which minimizes the mean-squared error (MSE) function between outputs and their desired values of MLP [4]. However, the EBP algorithm has drawbacks with slow learning convergence and poor generalization performance [5], [6]. This is due to the incorrect saturation of output nodes and overspecialization to training samples [6].

Usually, sigmoidal functions are adopted as activation functions of nodes in MLP. The sigmoidal activation function can be divided into a central linear region and two outer saturated regions. When an output node of MLP is in an extremely saturated region of the sigmoidal activation function opposite to a desired value, we say the output node is “incorrectly saturated.” The incorrect saturation makes updating amount of weights small and consequently learning convergence becomes slow. Also, when MLPs are trained too much for training samples, this causes overspecialization of MLP to training samples and generalization performance for untrained test samples will be poor.

Cross-entropy (CE) error function accelerates the EBP algorithm through decreasing the incorrect saturation of output nodes [5]. Furthermore, the n-th order extension of crossentropy (nCE) error function attains accelerated learning convergence and improved generalization capability by decreasing the incorrect saturation as well as preventing the overspecialization to training samples [6].

Information theory has done a great role in neural network community. For improved performance, information theoretic view provides many learning rules of neural networks such as minimum class-entropy, minimizing entropy, and feature extraction using information theoretic learning [7]-[11]. Also, information theory can be a basis for constructing neural networks [12]. The upper bound of probability of error was derived based on the Renyi’s entropy [13]. Maximizing the information contents of hidden nodes can be developed for better performance of MLPs [14], [15]. In this paper, we focus on the relationship between relative entropy and the CE error function.

Relative entropy is a divergence measure between two probability density functions [16]. Assuming that a random variable has two alphabets, the relative entropy becomes crossentropy (CE) error function which can accelerates the learning convergence of MLPs. Since nCE error function is an extension of CE error function, there must be a divergence measure corresponding to nCE error function as CE does. In this sense, this paper derives a new divergence measure from nCE error function. In section 2, the relationship between the relative entropy and CE is introduced. Section 3 derives a new divergence measure from nCE error function and compares the new divergence measure with the relative entropy. Finally, section 4 concludes this paper.

 

2. RELATIVE ENTROPY AND CROSS-ENTROPY

Consider a random variable x whose probability density function (p.d.f.) is p(x). In the case that the p.d.f. of x is estimated with q(x), we need to measure how accurate the estimation is. Therefore, the relative entropy is defined by

as a divergence measure between p(x) and q(x) [16]. Let’s assume that the random variable x has only two alphabets 0 and 1, in which the probabilities are

Also,

Then,

Here,

is the entropy of a random variable x with two alphabets and

is the cross-entropy. If we assume that ‘q’ corresponds to a real output value ‘y’ of MLP output node and ‘p’ corresponds to its desired value ‘t’, we can define the cross-entropy error function as

Thus, the cross-entropy error function is one specific type of relative entropy assuming that a random variable has only two alphabets [15].

We can use the unipolar [0, 1] mode or bipolar [-1, +1] mode for describing node values of MLPs. Since ‘t’ and ‘y’ corresponds to ‘p’ and ‘q’ respectively, they are in the range of [0, 1]. Thus, the relationship between relative entropy and cross-entropy error function is based on the unipolar mode of node values.

 

3. NEW DIVERGENC MEASURE FROM THE n-th ORDER EXTENSION OF CROSS-ENTROPY

The n-th order extension of cross-entropy (nCE) error function was proposed based on the bipolar mode of node values as [6]

where n is a natural number. In order to derive a new divergence measure from nCE error function based on the relationship between relative entropy CE error function, we need an unipolar mode formulation of nCE error function. That is derived as

We will derive new divergence measures from Eq. (9) with n=2 and 4.

When n=2, the nCE error function given by Eq. (9) becomes

where

and

By substituting Eqs. (11) and (12) into Eq. (10),

In order to derive a new divergence measure corresponding to nCE(n=2), t and y are substituted to p and q, respectively. This is the reverse procedure for deriving Eq. (7) from (6) by substituting ‘p’ and ‘q’ to ‘t’ and ‘y’, respectively. Then, we can get

Thus, by resembling the last equation in Eq. (4), the new divergence measure is derived by

where

When n=4, the nCE error function given by Eq. (9) is

where

and

Substituting Eqs. (18), (19), (20), (21), and (22) into Eq. (17),

By substituting t and y to p and q, respectively, we can get

Thus, by resembling the last equation in Eq. (4), the new divergence measure is derived by

where

In order to compare the new divergence measures given by Eqs. (15) and (25) with the relative entropy given by Eq. (4), we plot them in the range that p and q are in [0,1]. Fig. 1 shows the three-dimensional plot of relative entropy D(p||q). The x and y axes correspond to p and q, respectively, and z axis corresponds to D(p||q). D(p||q) is minimum of zero when p=q and it increases when p goes far from q. Since D(p||q) is a divergence measure, it is not symmetric.

Fig. 1.The three-dimensional plot of relative entropy D(p||q) with two alphabets

Fig. 2 shows the three-dimensional plot of new divergence measure F(p||q;n=2) given by Eq. (15). F(p||q;n=2) is minimum of zero when p=q and it increases when p goes far from q as D(p||q) does. Furthermore, we can find that F(p||q;n=2) is more flat than D(p||q). Also, the threedimensional plot of F(p||q;n=4) shown in Fig. 3 is minimum of zero when p=q and more flat than F(p||q;n=2) shown in Fig. 2. So, increasing the order n of the new divergence measure makes the new divergence measure more flat.

Fig. 2.The three-dimensional plot of new divergence measure with two alphabets when n=2, F(p||q;n=2)

Fig. 3.The three-dimensional plot of new divergence measure with two alphabets when n=4, F(p||q;n=4)

When applying MLPs to pattern classification, the optimal outputs of MLP based on various error functions were derived in [6] and [18]. We plot them in Fig. 4. The optimal output of MLP based on CE error function is a first order function of a posterior probability that a certain input sample belongs to a specific class. When using nCE error function with n=2 for training MLPs, as shown in Fig. 4, the optimal output of MLP shows more flat than the CE case. And, nCE error function with n=4 shows the optimal output more flat than CE and nCE with n=2 cases. The two-dimensional contour plots of CE and nCE error functions also show the same property [17]. So, we can argue that the property of divergence measures derived from CE and nCE coincides with the two-dimensional contour plot of CE and nCE error function in [17] and optimal outputs in [6] and [18].

Fig. 4.Optimal outputs of MLPs. Here, Q(x) denotes a posteriori probability that a certain input x belongs to a specific class

 

4. CONCLUSIONS

In this paper, we introduce the relationship between relative entropy and CE error function. When a random variable has only two alphabets, the relative entropy becomes cross-entropy. Based on the relationship, we derive a new divergence measure form the nCE error function. Comparing the three-dimensional plot of relative entropy and new divergence measure when n=2 and 4, we can argue that the order n of new divergence measure has an effect of flatting the divergence measure. This property coincides with the previous results which comparing the optimal outputs and contour plots of CE and nCE.

References

  1. K. Hornik, M. Stinchcombe, and H. White, “Multilayer Feed-forward Networks are Universal Approximators,” Neural Networks, vol. 2, 1989, pp. 359-366. https://doi.org/10.1016/0893-6080(89)90020-8
  2. K. Hornik, “Approximation Capabilities of Multilayer Feedforward Networks,” Neural Networks, vol. 4, 1991, pp. 251-257 https://doi.org/10.1016/0893-6080(91)90009-T
  3. S. Suzuki, “Constructive Function Approximation by Three-Layer Artificial Neural Networks,” Neural Networks, vol. 11, 1998, pp. 1049-1058 https://doi.org/10.1016/S0893-6080(98)00068-9
  4. D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing, Cambridge, MA, 1986.
  5. A. van Ooyen and B. Nienhuis, “Improving the Convergence of the Backpropagation Algorithm,” Neural Networks, vol. 5, 1992, pp. 465-471. https://doi.org/10.1016/0893-6080(92)90008-7
  6. S.-H. Oh, “Improving the Error Back-Propagation Algorithm with a Modified Error Function,” IEEE Trans. Neural Networks, vol. 8, 1997, pp. 799-803. https://doi.org/10.1109/72.572117
  7. A. El-Jaroudi and J. Makhoul, "A New Error Criterion for Posterior probability Estimation with Neural Nets," Proc. IJCNN'90, vol. III, Jun. 1990, pp. 185-192.
  8. M. Bichsel and P. Seitz, “Minimum Class Entropy: A maximum Information Approach to Layered Networks,” Neural Networks, vol. 2, 1989, pp. 133-141. https://doi.org/10.1016/0893-6080(89)90030-0
  9. S. Ridella, S. Rovetta, and R. Zunino, “Representation and Generalization Properties of Class-Entropy Networks,” IEEE Trans. Neural Networks, vol. 10, 1999, pp. 31-47. https://doi.org/10.1109/72.737491
  10. D. Erdogmus and J. C. Principe, "Entropy Minimization Algorithm for Multilayer Perceptrons," Proc. IJCNN'01, vol. 4, 2001, pp. 3003-3008.
  11. K. E. Hild II, D. Erdogmus, K. Torkkola, and J. C. Principe, “Feature Extraction Using Information-Theoretic Learning,” IEEE Trans. PAMI, vol. 28, no. 9, 2006, pp. 1385-1392. https://doi.org/10.1109/TPAMI.2006.186
  12. S.-J. Lee, M.-T. Jone, and H.-L. Tsai, “Constructing Neural Networks for Multiclass-Discretization Based on Information Theory,” IEEE Trans. Sys., Man, and Cyb.- Part B, vol. 29, 1999, pp. 445-453. https://doi.org/10.1109/3477.764881
  13. D. Erdogmus and J. C. Principe, "Information Transfer Through Classifiers and Its Relation to Probability of Error," Proc. IJCNN'01, vol. 1, 2001, pp. 50-54.
  14. R. Kamimura and S. Nakanishi, “Hidden Information maximization for Feature Detection and Rule Discovery,” Network: Computation in Neural Systems, vol. 6, 1995, pp. 577-602. https://doi.org/10.1088/0954-898X_6_4_004
  15. K. Torkkola, "Nonlinear Feature Transforms Using Maximum Mutual Information," Proc. IJCNN'01, vol. 4, 2001, pp. 2756-2761.
  16. T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 1991.
  17. S.-H. Oh, “Contour Plots of Objective Functions for FeedForward Neural Networks,” Int. Journal of Contents, vol. 8, no. 4, Dec. 2012, pp. 30-35. https://doi.org/10.5392/IJoC.2012.8.4.030
  18. S.-H. Oh, “Statistical Analyses of Various Error Functions For Pattern Classifiers,” CCIS, vol. 206, 2011, pp. 129-133.