1. Introduction
With the rapid development and widespread use of computer and communication technologies, the environment for huge image information is generated. Object image classification is a fundamental problem in the fields of computer vision and image understanding. Its intention is to categorize unlabeled images into the pre-defined classes according to their semantic meanings. In the last decades, BoVW model based methods [1-5] have achieved good classification performances on many image data sets. It consists of four major steps shown as in Fig. 1, namely:1) descriptor extraction; 2) feature coding; 3) spatial pooling; and 4) support vector machine (SVM) classification, to classify an image into its semantic category. In a typical setup, gradient-based local image descriptors, such as scale-invariant feature transform (SIFT) [6], PCA-SIFT [7], SURF [8] and so on. These descriptors are all invariant to various image degradations, such as geometric and photometric transformations, which is essential when addressing image categorization problems. The experiment results of literature [9] indicated that SIFT presents its stability in most situations and SURF is the fastest one. Since the environment of object classification experiments is complicated, the stability of the image features is especially important. So, in our paper ,we choose SIFT feature.
Fig. 1.The BoVW framework for object category
There are several different ways to encode the local features (SIFT) and generate feature codes, which can be seen as a visual dictionary, such as vector quantization(VQ) [1-5], sparse coding [10,11], or Fisher kernels [12]. The vector quantization coding methods treat an image as a collection of unordered appearance descriptors extracted from local patches, quantizes them into discrete “visual words”, and then computes a compact histogram representation for image classification. However, there exists some synonymy and ambiguity problems in visual words [13-15] as well as the seriously quantization error of compact histogram representation. In addition, due to the existence of image background noise and the limitation of clustering algorithms [16,17], some visual words generated from SIFT features may be less meaningful to express useful image content, but decrease the semantic resolution ability of visual dictionary. These noise words are similar to stop words, such as “the” “and” “is”, existing in text documents. In this context, we call them “visual stop-words”.
Sparse coding computes a spatial-pyramid image representation based on sparse codes of SIFT features, instead of the K-means vector quantization in the traditional BoVW. Yang et al. [10] proposed the ScSPM method where sparse coding was used instead of vector quantization to obtain nonlinear codes. Wang et al. [11] presented a Locality-constrained Linear Coding (LLC), which can be seem as a fast implementation of LCC that utilizes the locality constraint to project each descriptor into its local-coordinate system. Moreover, He et al. [18] proposed a spatial pyramid pooling in deep convolutional networks. Furthermore, unlike the original BoVW model that performs spatial pooling by computing histograms, the sparse coding approaches always use max spatial pooling [10,11] that is more robust to local spatial translations and more biological plausible
The Fisher kernel [12] is a powerful tool to transform an incoming variable-size set of independent samples into a fixed size vector representation, assuming that the samples follow a parametric generative model estimated on a training set. This description vector is the gradient of the sample’s likelihood with respect to the parameters of this distribution, scaled by the inverse square root of the Fisher information matrix. Jegou et al. [19] proposed VLAD representation derived from both original BoVW and Fisher kernel, that aggregates SIFT descriptors and produces a compact representation. Although there are several feature coding approaches and spatial pooling schemes, it should be pointed out that the main work of our article is aim at solving the synonymy and ambiguity of visual words, “visual stop-words” etc problems of the traditional vector quantization coding based BoVWmodel.
2. Related Work
In order to overcome the influence of these negative factors brought by synonymy and ambiguity of visual words, many researchers have made lots of explorations and attempts. Philbin et al. [20] presented a kind of BOVW model based on soft-assignment to build the visual vocabulary histogram. In which, a SIFT feature point is assigned to several nearest visual words, and each word is weighted according to the distance. Gemert et al. [15] established a visual word uncertainty model, where some kernel functions were carried out to complete soft-mapping between local features and visual words. This model can efficiently decrease the quantization error, as well as further verify the effectiveness of soft assignment method in solving the synonymy and ambiguity problem of visual words. Li et al. [21] constructs the histograms using a kind of context information strategy to improve the mapping accuracy. To some extent, it can also reduce the quantization error caused by words synonymy and ambiguity. Weinshall et al. [22] proposed a Latent Dirichlet Allocation model based soft-assignment method (LDA+SA). Considering the ambiguity effect of visual words, Danilo et al. [23] introduced a fuzzy clustering algorithm to complete the soft-assignment of visual words, which achieved good results.
In comparison with the conventional hard-assignment based BoVW model (BoVW+HA) [14], the above mentioned methods, all can overcome the problem of synonymy and ambiguity on visual words, reduce the quantization error, and enhance the semantic expression ability of histograms. However, the weakness for these methods is that they all measure the semantic distance between words in low feature space. Due to the inconsistency of metric space, the visual words are not relatively close in semantic space as they are in feature space. Furthermore, all these methods [20-23] assign each SIFT feature with the same number of visual words, which will lead to new noisy and redundant information for the reason that some local features without ambiguity are mandatorily mapped to multiple visual words. Therefore, once semantic relevance of visual words is accurately measured, and the number of soft-assignment words is adaptively selected according to different categories of SIFT features, both problems of synonymy and ambiguity in visual words as well as the serious quantization error, could be overcome significantly.
Moreover, The removal of “visual stop-words” will not cause a significant content loss but improve the classification accuracy significantly. Considering the relationship between the capacity of the words information and their appearance frequency, Sivic et al. [1] proposed a method to eliminate “visual stop-words” based on term frequency. Yuan et al. [24] proposed a solution using the “visual phrase” technique based on an improved frequency itemset mining algorithm and a likelihood ratio test method, but this method only considers the co-occurrence information among visual words and ignores the spatial information of visual words. Chen et al. [25] gave a high discriminative visual phrase (DVP) method which can filter noise efficiently overcoming the problem of feature information loss in traditional visual phrases construction methods [26]. Roman et al. [27] proposed a new methodology for the automatic estimation of the optimal amount of visual words that can be removed from a visual dictionary. This method relies on a special definition of the entropy of each visual word when considered as a random variable, and a new definition of the overlap of class models computed with a normalized Bhattacharyya coefficient.
However, these methods all ignore the interrelationships between the visual words and semantic concepts of different images. Therefore, some visual words with less occurrence frequency but with high discrimination are easily mistaken for “visual stop-words”.
For the purpose that identifying the semantic relevance more accurately among visual words, selecting soft assignment words number for different local features adaptively, as well as eliminating “visual stop-words” effectively, a novel bag of visual words method based on PLSA and chi-square model for object category is proposed in this paper. The main contribution is to mine image semantic topics using PLSA model and K-L divergence, accurately measuring the semantic distance between words. Meanwhile, the analyzing of the ambiguity of SIFT features perform soft-assignment more accurately between features and homoionym. Based on that, some “visual stop-words” are eliminated by chi-square model, Therefore, the method described in this paper can effectively solve the problem of synonymy and ambiguity of visual words, and improve the image classification accuracy by enhancing distinguishing ability of visual dictionary.
3. Bag of visual words method based on PLSA and chi-square model for object category
For training image dataset I = {I1,I2,...Ik}, the method proposed in [6] is used to extract SIFT features and approximate K-Means algorithm (AKM) is adopt to generate visual dictionary. The entire process of bag of visual words method based on PLSA and chi-square model for object category is shown as Fig. 2. Firstly, PLSA is used to analyze the semantic co-occurrence probability of visual words and infer the latent image semantic topics, and the conditional probability of a specific word w given the unobserved latent topic z can be obtained. Secondly, Bayesian estimation is utilized to infer the latent topic distribution induced by individual word probability P(z|w) and K-L divergence is used to measure the semantic distance between visual words and get the homoionyms which have similar semantic. Then, the mapping between SIFT features and words with similar semantic is completed by adaptive soft-assignment. Finally, chi-square model is introduced to analyze the correlations between visual words and various image categories and a number of “visual stop-words” with weak correlation is eliminated to reconstruct the visual vocabulary histograms, and the object classification is completed through SVM classifier at last.
Fig. 2.The flow of bag of visual words method based on PLSA and chi-square model for object category
3.1 Visual semantic concept expression and measurement
The reasons leading to the problem of “Semantic Gap” widely exists in the field of computer vision, which could be explained by the inconsistency between the feature space and semantic space on distance measuring. Therefore, the traditional methods [15,22], which measure the semantic distance in Euclidean space, could not accurately reflect the actual semantic relevance between visual words. The method described in reference [21] represents the semantic concepts by getting the conditional probability distribution of a specific word w given the image category and hence achieves satisfied classification accuracy. But a precondition that unable to contain the same semantic concept in different categories of images is enforced in this method. Obviously, the precondition is difficult to ensure in real-world applications. While using PLSA model, the conditional probability distribution of a specific word w given the unobserved latent topic z can be obtained, and the semantic concept implied in the word can be expressed more accurately. In the following section, the method of PLSA model based visual words expression will be introduced.
1 Visual semantic concept expression based on PLSA model
PLSA proposed by Hoffman et al in [28] is a topic generated model for the latent semantic analysis, which is widely used in machine learning and information retrieval. For training image set I = {I1,I2,...Ik} and the visual words W = {w1,w2,...,wn} generated by AKM clustering, we can get the co-occurrence frequency matrix N = [n(wi,Ij)] of images and visual words, where, n(wi,Ij) is the number of times wi appeared in image Ij. The joint probability of (w,I) can be calculated as Equation (1):
Where Z indicates all topics in the latent semantic space. According to the maximum likelihood principle, the P(z),P(w|z) and P(I|z) can be obtained by Equation (2) through EM algorithm.
Then, Bayesian estimation is utilized to infer the word w occurrence probability P(w) and the latent topic distribution induced by individual word P(z|w) as:
It should be noted that the number of topics in present PLSA model is mostly a fixed value which is a set artificially from experience [29]. The topic model is trained and finally gets the image semantic representation about the fixed topic set. This method with setting topic numbers artificially overlooks the situation that the content ranges from sample to complex among different image categories. In view of this, here we use a density based adaptive topic number selection method of the PLSA model [30]. When building this topic model for semantic content of different image categories, the method could get much better topic analyzing results for its automatically setting the number of semantic topic according to the complexity of image content.
2 Semantic distance measurement with K-L divergence
For the reason that the same image may contain more than one latent semantic topic, and for semantic topics’ difference, they have different contributions to express the semantic content of images. In consequence, we need to weight these semantic topics adaptively. Inspired by the literature [25], here we use K-L divergence [21] to measure semantic distance between visual words, and the conditional entropy H is applied to measure discrimination of each topic. H could be represented as,
It is easy to see from Equation (5) that the higher value of H presents the latent topic z has less discriminative capability. Then, a Gaussian function is used for normalization of H(I |z∈Z) to generate the discrimination weight ω(z)as,
However, K-L divergence is asymmetric meaning that, it cannot always guarantee d(wi,wj) = d(wj,wi). Hence, the semantic distance d(wi,wj) between the visual words wi and wj can be calculated as,
According to Equation (3) - Equation (7), we can calculate the semantic distance between visual words and get the semantically related homoionym.
3.2 Adaptive soft-assignment
After getting the homoionyms through PLSA model and K-L divergence, to realize constructing visual vocabulary distribution histogram with adaptive soft-assignment, we first need to analyze the fuzziness caused by mapping the SIFT features. The fuzziness diagram is shown in Fig. 3. As Fig. 3 shows, the little dot represents SIFT feature, oval represents visual word and diamond and square denote two SIFT features with different fuzziness. For diamond feature, if it is only closest to visual word w1 and farther from other visual words, we can assume its semantic content can be expressed by visual word w1 and the feature point has no or little fuzziness defined as the first class feature. For square feature, if it has a close range with the distance between visual words w2 and w3 (or among more words), we can assume its semantic content should be expressed by w2 and w3 or more visual words together. That is, this kind of feature point is fuzzier and is defined as the second class feature.
Fig. 3.The sketch map of SIFT features fuzzification
Suppose that the visual dictionary is defined by W = {w1,w2,...,wn}, where n denotes the size of visual dictionary. The different classes of SIFT features can be adaptively mapped to a certain number words by adaptive soft-assignment strategy by calculating the distance between each SIFT feature to the homoionym based on the method mentioned in section 2.1 and distinguishing different kind of SIFT features. The entire process can be described as follows.
Step1: For image I = {r1,r2,...,ri,...,rT}, T denotes the number of SIFT features in image I. We first calculate the closest visual word to SIFT feature ri in visual dictionary W;
Step2: According to Equation (7) described in section 2.1, we can get the m semantic closest visual words , (1 ≤ j ≤ m) to word in visual dictionary. Then, calculate the Euclidean distance between SIFT features and each of themwords respectively, and sign the distance from smallest to largest as d = {d1,d2,...,dj,...,dm}, here dj indicates the j closest distance between words and feature points.
Step3: Based on principle Nadp = argmax{di ≤ α·d1}, (i = 1,2,...m), the number of visual words Nadp assigned to ri can be determined adaptively. And each word is weighted with , where α is the “adaptive soft-assignment factor” that used to control the assigned words number, usually α ≥ 1. Repeat the above-mentioned process, visual vocabulary distribution histograms can be constructed with adaptive soft-assignment.
3.3 “Visual stop-words” elimination
Chi-Square model [31] is often used to measure the independence of two random variables. Here, we utilize chi-square model to perform statistical analysis on the correlations between visual words and each category of images, as well as to discover visual stop words and eliminate them. The smaller chi-square value means the less correlation between the visual word and each image categories, indicating the weaker discrimination, and vice versa. Therefore, combined with chi-square model and term frequency, the “visual stop-words” can be eliminated more efficiently. Here, assuming the appearance frequency of visual word w is independent of any image category Ij, Ij ∈ I (1 ≤ j ≤ k) where I = {I1,I2,...Ik} is the training image set. The interrelationship between visual word w and image category in training set I can be described by Table 1.
Table 1.The statistical relationships between visual word W and each image category
where, N is the total number of images in I, n1j denotes the number of images that contain word w in category I j, n2j is the number of images that don’t contain w in category I j, n+j denotes the total number of images in category I j,ni+ (i=1,2)is the total number of images that contain w and the total number of images that don’t contain w in image training set I respectively. Here, the chi-square value between visual word w and each image category can be calculated as,
The chi-square value indicates the different degrees of statistical correlation between w and each image category. Moreover, considering the influence of word frequency, the chi-square value is given a corresponding weight as,
Where tf(w) indicates the term frequency of w. It’s easy to indicate that Equation (9) accounts for both the word frequency of w and the statistical correlation between w and each image categories. Therefore, a percentage of (recorded as S) visual dictionary can be removed as “visual stop words” according to the chi-square values from small to large. And the corresponding dimensions of words would be eliminated when constructing the visual vocabulary histograms.
4. Experiments
4.1 Experimental dataset setup and evaluation
In this experiment, we use the standard test image collections Caltech-256 [32] and Pascal Voc 2007[33] to evaluate object classification performance. The Caltech-256 dataset holds 29,780 images falling into 256 categories with much higher intra-class variability and higher object location variability compared with Caltech-101. Each category contains at least 80 images. The Pascal Voc 2007 involves 9,963 images in 20 categories. 5,011 images are for training, and the rest are for testing. Firstly, we do some experiments on the whole Caltech-256 dataset to evaluate the effectiveness of PLSA, adaptive soft-assignment and the chi-square model respectively. 50 images are choose in each category to construct training image set for generating visual dictionary and the remaining are as testing set. The visual dictionary size is 15K. The SVM classifier is employed here, particularly LIBSVM [34] which kernel function is Radical Basis Function. To obtain reliable experimental results, all object classification experiments are run 10 times and then averaged to produce the final average precision. The hardware configuration for experiment is a desktop with Core 3.1G×4 CPU and 4G of Ram. The performance criteria of object classification are recall rate, accuracy rate, and confusion matrix based on recall rate and Average Precision (AP). The related definitions are as follows,
4.2 Experimental results
First of all, in order to evaluate the effectiveness of PLSA based on Soft-Assignment method (PLSA+SA) on overcoming the synonymy and ambiguity problem of visual words, we compare it to the traditional soft- assignment (SA) method [35] and Hard-Assignment (HA) method [14] respectively. Fig. 4 depicts the relationship between average precision of different methods and the number of soft-assignment words. It can be concluded that from Fig. 4, the average precision of SA method and PLSA+SA method presented in this article is higher than HA method. As Hard-Assignment method is to assign each SIFT feature to the nearest single word, its AP values are always 62.7% and does not change with the soft assignment numbers. In contrast, the AP scores of SA method and PLAS+SA method increase the soft-assignment word numbers firstly; however, when the number exceeds a certain value, the average precision is decreasing. The reason is that too few soft-assignments will not be adequate to express the content of feature points, while too many of them will lead to excessive assignment and introduce new redundant information. Moreover, the PLSA+SA method in this paper can analyze the similarity between words from the semantic concepts content, and then assign the corresponding feature points to a number of visual words that with similar semantic concepts. Therefore, the PLSA+SA method proposed in this paper can better overcome the quantization error and other problems that brought by synonymy and ambiguity of visual words, as well as the average precision is also superior to the traditional SA method.
Fig. 4.The AP values comparison of different methods
Note that in the experiment of Fig. 4, we assign the same number of words to each SIFT feature point without considering the differences of SIFT features. This will inevitably make some unambiguity local features map to multiple visual words, and introduce new noise and redundant information. By the content of section 3.2, it can overcome this problem by analyzing the ambiguity category of SIFT features and then implementing the adaptive soft-assignment method. Hence, after obtaining homoionym using PLSA model, to verify the effectiveness of this adaptive soft-assignment and analyze how it changes over adaptive soft-assignment factor α, we make object classification experiments with the traditional soft-assignment method (PLSA+SA) and adaptive soft-assignment method (PLSA+ASA), respectively. Setting the value m of adaptive soft-assignment method in section 3.2 is equal to 20, and for PLSA+SA method, we set the soft-assignment word numbers as 6 and the average precision is 71.89%. The results of AP values for object classification are shown as Fig. 5. From Fig. 5 it can be seen that when the factor α gets larger, the SIFT features with different fuzzy category can be more accurately assigned to a number of homoionym, and the average classification accuracy of PLSA+ASA method also increases. When α = 2.2, the AP score reaches a maximum 75.47%, which is superior to that of PLSA+SA method. However, when the value of α increases to a certain degree, its AP score tends to decrease to some extent, for the reason that too large of α can also cause over assignment problem which is usually occurred in traditional soft-assignment method.
Fig. 5.The influence for AP of factor α
Then, in order to evaluate the impact of the percentage S for AP result and compare it to PLSA+ASA without “visual stop-words” elimination as well as PLSA+ASA based on literature [27]. The AP values of different methods are shown in Fig. 6. From Fig. 6 it can be seen that eliminate a certain percentage of visual words by chi-square model and the method of [27] both can improve the average precision of object classification. Moreover, their AP values will be maximized when the percentage of removed visual words are S =15% and the performance of our method PLSA + ASA + CSM is better than the method PLSA+ASA based on literature [27]. However, it inevitable eliminate some words with strong representation when the percentage of removed visual words is too big, that will greatly reduce the classification performance.
Fig. 6.The influence for AP of the percentage S
Furthermore, Fig. 7 shows the confusion matrix of 15 categories in Caltech-256 obtained by PLSA+ASA method with non- eliminated visual stop-words. And Fig. 8 depicts the confusion matrix of PLSA+ASA+CSM with visual words elimination percentage S =15%. Both of Fig. 7 and Fig. 8 results are based on the whole Caltech-256 dataset. From Fig. 7 and Fig. 8, we can conclude that the “visual stop-words” elimination method in our paper can improve the recall rate of all object classification efficiently.
Fig. 7.The confusion matrix of non- eliminated visual stop-words
Fig. 8.The confusion matrix of the percentage of eliminated visual word S =15%
Finally, we do experiment on Pascal Voc2007 dataset to further evaluate the effectiveness of our method in large image dataset environment, in which the values of each parameter are α = 2.2, m = 20,S =15% and the dictionary size is 10K. We use the same classifier to compare the average precision of our method-PLSA+ASA+CSM with that of hard-assignment based bag of visual words model (HA) [14], Soft-Assignment based bag of visual words model(SA)[35], Contextual information based bag of visual words model [21] and LDA model based soft-assignment method(LDA+SA)[22], respectively. The average precisions of different methods are shown as Table 2. From Table 2, we can conclude that both SA method and contextual-BoVW methods both introduce some strategies to overcome quantization error caused by synonymy and ambiguity of visual words, hence the object classification accuracy is obviously better than HA method. Meanwhile, due to the combining with LDA model, the LDA+SA method can express the image content more accurately, and the average precision could be further improved. Compared with previous methods, the average precision of the method proposed in this paper is highest.
Table 2.The object classification results of different methods on Pascal Voc2007 database
5. Conclusion
We have proposed a novel bag of visual words method based on PLSA and chi-square model for object category. First of all, in view of the serious quantization error problem caused by synonymy and ambiguity on visual words during constructing the visual vocabulary histograms, we use PLSA model to get the probability distributions of semantic topics on some visual words, then measure the semantic distance between visual words through K-L divergence, thus to obtain the homoionym in semantic space. Secondly, according to different fuzziness categories of SIFT features, the adaptive soft-assignment strategy is proposed for mapping the SIFT features to a number of homoionym adaptively, which can reduce the quantization error efficiently. Finally, chi-square model is adopt to analyze the relativity between each visual word and image category, and based on that, the “visual stop-words” induced by the limits of clustering algorithm or image background noises are eliminated to reconstruct the histograms. Finally, object classification is implemented through SVM classifier. The experimental results show that our method can overcome the synonymy and ambiguity of visual words as well as the quantization error problemto some degree. Moreover, the method can effectively eliminate “visual stop-words” in visual dictionary, which can improve the object classification performance substantially. It should be noted that our method cannot measure the semantic distance between SIFT feature and visual word while analyzing the distance between visual words on the semantic level as well as there are several different image vector representation ways, such as sparse coding, Fisher kernel coding etc. Therefore, how to make the distance in feature space much closer to the real semantic distance through distance metric learning and construct a more efficient image vector representation are important research keys that need to be concerned in the future.