DOI QR코드

DOI QR Code

Stochastic Non-linear Hashing for Near-Duplicate Video Retrieval using Deep Feature applicable to Large-scale Datasets

  • Byun, Sung-Woo (Department of Computer Science, Graduate School, SangMyung University) ;
  • Lee, Seok-Pil (Department of Computer Science, Graduate School, SangMyung University)
  • Received : 2018.09.26
  • Accepted : 2019.06.29
  • Published : 2019.08.31

Abstract

With the development of video-related applications, media content has increased dramatically through applications. There is a substantial amount of near-duplicate videos (NDVs) among Internet videos, thus NDVR is important for eliminating near-duplicates from web video searches. This paper proposes a novel NDVR system that supports large-scale retrieval and contributes to the efficient and accurate retrieval performance. For this, we extracted keyframes from each video at regular intervals and then extracted both commonly used features (LBP and HSV) and new image features from each keyframe. A recent study introduced a new image feature that can provide more robust information than existing features even if there are geometric changes to and complex editing of images. We convert a vector set that consists of the extracted features to binary code through a set of hash functions so that the similarity comparison can be more efficient as similar videos are more likely to map into the same buckets. Lastly, we calculate similarity to search for NDVs; we examine the effectiveness of the NDVR system and compare this against previous NDVR systems using the public video collections CC_WEB_VIDEO. The proposed NDVR system's performance is very promising compared to previous NDVR systems.

Keywords

1. Introduction

With the development of video-related applications such as video-sharing websites, vide obroadcasting, and advertising services, the amount of available media content has increased dramatically. Among this huge amount of media content, there is a substantial portion of duplicate and near-duplicate videos (NDVs) that have gone through video editing and redistribution [1-5]. Such NDVs largely influence video-related applications, removing them requires near-duplicate video retrieval (NDVR). The principle of NDVR is to retrieve accurate and efficient NDVs that are defined as identical or approximately identical videos that arealmost exact duplicates of each other. To reflect these needs, NDVR is utilized in various fields; it is applicable in various uses such as copyright protection, video monitoring, and recommendation systems. For instance, we can use NDVR in search engines, ensuring thatusers can enjoy a range of videos distinct from each other rather than be faced with an endlesslist of the same or similar clips. Another example is using NDVR to reduce the risk that videoproducts face of being compromised by unauthorized copying, editing, and redistribution; therefore, NDV detection is important for copyright protection.

NDVR involves searching for identical or approximately identical videos to existing videos[6] and consists of three main parts: 1) keyframe extraction, 2) feature extraction, 3) similarity computation. Keyframe extraction extracts multiple representative images from each video atregular intervals; it then extracts features to characterize each keyframe. Feature extractiongenerates numerical characteristics from each image using domain knowledge. Well-extracted features increase the retrieval algorithms’ effectiveness. Similarity computation calculatessimilarities between videos using the extracted features and finally retrieve near-duplicatevideos based on the similarities. NDVR system computes similarities and retrieve relevant videos by exhaustively comparing the features extracted from each keyframe between all pairwise keyframes. Here, because no single feature type exists that is sufficiently robust tocapture all variations in the information, previous studies have proposed video representation methods that combine multiple types of feature. Many studies have used HSV histogram[2,3,7,8] as the global feature type such as contrast changes, or sensitivity to brightness and local binary pattern (LBP) as the local feature type [7,9,10]. Even though comparing allavailable keyframe pairs and using complex and high-dimension features can offer accurateretrieval results, in practice, there are time complexity problems.

This paper proposes a novel NDVR system that supports large-scale retrieval, whichextends the hashing functions to non-linear functions through dimension conversion. In addition, this system contributes to the efficient and accurate retrieval performance. For this, we extract keyframes from each video at regular intervals. Then, we extract both commonly used features (LBP and HSV) and a new image feature from each keyframe. The new imagefeature, which recent research introduced, can provide more robust information than existing features even if there are geometric changes and complex editing of images because it involves object localization without bounding boxes. We convert a vector set that consists of the extracted features to binary code through a set of hash functions so that the similarity comparison can be efficient by making similar videos more likely to map into the same buckets. Lastly, we calculate similarity to search for NDVs. We examine and compare the NDVR system’s effectiveness against previous NDVR systems using the public videocollection CC_WEB_VIDEO.

The remainder of this paper goes as follows: Section 2 presents related works for NDVR systems, Sections 3 and 4 explain the basic NDVR system and the proposed method, respectively, Section 5 shows the experiment results, and Section 6 concludes this work.

2. Related Works

Most NDVR approaches carry out retrieval by extracting features from video content. Onecommon strategy is to extract low-level features for characterizing each keyframe afterselecting keyframes from videos through uniform sampling. The most common features arecolor information such as RGB and HSV histogram [2,3,7,8,11]; previous research oftenreferred to these as global features. However, because the characteristics of videos can bechanged by major variations such as histogram normalization and color variation, these can be used to retrieve videos that are almost identical to the query video with minor variations [7]. Compared to global features, local features are more robust to complex editing or geometric changes and they generally provide better performance when processing videos with complex scenes. Local binary pattern (LBP) is commonly utilized as the local feature [12]. Features in this category include Difference of Gaussian (DOG) [13], scale invariant feature transform(SIFT) [14], and a mixture of principal component analysis (PCA) and SIFT referred to as PCA-SIFT [15]. In recent studies using these image features, several have investigated areassuch as understanding an image and finding special regions in an image [16, 17]. Meanwhile, Bolei Zhou et al. [18] used convolutional neural network (CNN) to propose deep features with localizability applicable to images. We refer to this feature as class activation maps (CAM). CAM for a particular category indicates discriminative image regions used by the convolutional neural network (CNN) to identify the category as shown in Fig. 1.

Fig. 1. Examples of CAM

Comparing complex and high-dimension features can provide accurate retrieval results butis very time-consuming. The hashing technique enables large-scale retrieval through rapid pairwise similarity comparison between videos. Classic hashing technologies include locality sensitive hashing (LSH), which converts video data into binary code through a set of randomprojections. Using LSH means that similar objects are more likely mapped into the same buckets. In addition, previous studies have developed various extensions of LSH. For example, J. Song et. al proposed a learning-based hashing method that jointly learns pseudo class labels and the hash code for given objects based on a discriminant embedding framework driven by linear discriminant analysis [19]. Other examples are spectral hashing (SPH) [10], self-taught hashing (STH) [9], semi-supervised hashing (SSH) [20], and supervised hashing with kernels [21]. they construct hashcodes using different distance measures.

3. Near-duplicate video retrieval

3.1 Definition of near-duplicate video

NDVs are defined as follows in previous research: Near-duplicate web videos are identical orapproximately identical videos close to exact duplicates of each other, but different in fileformat, encoding parameters, photometric variations (color or lighting changes), editing operations (caption, logo, and border insertion), different lengths, and certain modifications (added or removed frames). A user would clearly identify the videos as "essentially the same & quot; [6]. NDVs are two videos that look the same or approximately the same; the two do not have to be pixel identical for us to consider them duplicates. Whether two videos are duplicates depends entirely on the type of their differences and the comparison’s purpose. Originally, precise duplicate videos and near-duplicate videos were different definitions, but this paperincludes exact duplicates in our definition of near-duplicate videos.

3.2 Structure of near-duplicate video retrieval

Generally, we use NDVR to search for identical or approximately identical videos and output a ranked list of videos that are relevant to a specific user-provided query. We search NDVsusing a constructed retrieval system as follows:

Step 1: Keyframe extraction

The characteristics of video data include both image information and significantinformation such as audio data. In addition, video data has a concurrent temporal, complex, and informal structure. We generally make summarized information from such huge data by extracting keyframes from the video. The keyframe extraction method extracts key frames from each video at regular intervals; assuming that we extractnkey frames from videos, laterprocessing steps focus on information provided by thosenextracted keyframes.

Step 2: Feature extraction

Feature extraction is a process for generating numerical characteristics based on domainknowledge about data. An important factor for feature extraction is that it requires compactand reliable features because it deals with big data elements such as video. Therefore, previous studies have proposed various approaches using different features. Global features to reflect the whole characteristic of an image are suitable for identifying copies in formatting modifications such as frame resolution changes and format conversion. Unlike global features, we can extract local features after segmenting an image into regions and computing a set of color, texture and shape features for each region. We consider such local features robust and tolerant of geometric and photometric variation. However, there are too many local points forefficient, exhaustive comparison, even between two frames. The notation of the extracted feature vector is as follows: Let { , , ... }d x x x x x  be the feature vector for one feature type. Assuming that nkey frames are extracted from a video, each feature type includes the sizendwhered is the length of each feature. For example, we use { , , ... } i i i i id x x x x x  todenote the first feature type for theith keyframe.

Step 3: Hash code generation Many previous studies have used hashing when retrieving huge data such as video. This converts an input vector into a fixed-length binary string through hash functions. Generally, alonger hash code provides better performance, but is also more time consuming. The most classic and general hashing approach is a random projection; this generates binary code (hash code) by projecting the extracting features into random lines designated as an auxiliary space, and this hash code makes similar videos more likely to map into the same buckets. This stephas a set of shash functions{ , , ... } i i i i is      each of which takes extracted features as inputs and returns a binary number. Finally, a set of hash functions generates a hash code matrix of sizenswherenis the number of keyframes.

Step 4: Similarity computation

We generate a unique hash code matrix for each video; we use the Hamming distance between generated hash codes to assess the similarity between videos. This returns a list of videos that possess the highest similarity to the query video.

4. Proposed NDVR system

Fig. 2. Flowchart of the proposed NDVR system

Fig. 2 shows a flowchart of the proposed NDVR system. The system consists of key frame extraction, feature extraction, hashing, and similarity computation. The keyframe extraction method extracts nkey frames from each video in regular intervals as mentioned in Section 3 and then extracts features from thenkey frames.

4.1 Keyframe extraction

As mentioned in Section 3.2, the keyframe extraction method extracts keyframes from videos at regular intervals. In this research, we set the interval such that the method extracts akey frame every 10 seconds.

4.2 Features

1) HSV

We calculate a color histogram for each keyframe in the video that is a representative global feature and reflects the global statistics or summaries of low-level color in videos. Werepresent each feature as{ , , ... } im HSV HSV HSV HSV HSV ; this includes hue, saturation, and value. The equation to normalize the histogram to an overall size is as follows:{ }, 

(1)

Where M is the biggest value in histogram andijHSV is the jth value of the colorhistogram at key framei.


2) Local binary pattern

LBP indicates the texture representation of images and studies commonly utilize this as the local feature as it tends to be more robust in complex editing, photometric, and geometric changes than global features. We can extract LBP features by comparing the brightness of the eight pixels adjacent to the pixel at the center.

Fig. 3. How to extract an LBP feature

As shown in Fig. 3, if the brightness value of the surrounding pixels is greater than or equalto the central pixel, this set as 1; if smaller, this is set as 0. Then, we convert the generated binary number 01110011 to the decimal number 115. Fig. 4 shows an example LBP feature.

Fig. 4. An example LBP feature (a) original image (b) visualization of LBP (c) the histogram of LBP

The original LBP application uses this histogram as a texture model for the corresponding image region (e.g. the texture of grassplot, forest, land, sky, and object). Many studies used LBP in recognition or detection problems because they can express complex pattern changeseven though they were originally developed to classify image textures.

3) Class activation maps

Apart from HSV and LBP, we can use many other feature extraction methods tocharacterize keyframes. Bolei Zhou et al. recently proposed deep features for discriminativelocalization. We refer to this feature as class activation maps (CAM) [18]. A class activation map for a particular category indicates the discriminative image regions used by the convolutional neural network (CNN) to identify the category as shown in Fig. 1. This easily identifies the discriminative image regions in a single forward pass for a wide variety of tasks, even when we have not originally trained the network. For example, in Fig. 5, even if we traina network using Fig. 5 (a), it can identify similar image regions in Fig. 5 (b).

Fig. 5. Localizing class-specific image regions

This research uses a deep CNN design composed of five convolutional layers, five poolinglayers, and one fully connected layer. We applied a Rectified Linear Unit (ReLu) activation function on five convolutional layers and a softmax function on one fully connected layer. The first convolutional layer has 32 5x5 filters and uses the same padding; the layer’s output decreases by half through the pooling layer. Consider as an example an input image of 224x224x3 components; the resulting output of the first convolutional layer would be 112x112x32 components. The remaining convolutional layers are 5x5 and use the samepadding. We identify and categorize the information of videos using the ground truthinformation provided by the data set. Then, we train the designed CNN using the videos as aninput. Fig. 6 (a) shows the whole architecture and (b) shows examples of CAM featuresextracted from the network architecture. Here, in the figures of examples, we visualize the values that correspond to the features as Heat Maps.

Fig. 6. (a) The network’s whole architecture (b) Example CAM features extracted from network (a)  

4.3 Hashing

Hashing has drawn attention in large-scale data retrieval. In recent research, Yanbin Hao et al. proposed a stochastic multiview hashing algorithm to facilitate the construction of alarge-scale NDVR system [7]. They learned which hash functions applied to linear functions by maximizing the mixture of the generalized retrieval precision and recall scores. Then, they converted multiple features to binary hash code strings. In this study, we extend the hashing functions to non-linear functions through dimension conversion.

Given multiple feature vectors X from a set of n keyframes that involve all feature types, the vector { , , ... } i i i i id X x x x x stores the features of theith keyframe. We convert thesefeature vectors to binary hash codes with sizes through the hash functions, which we expressas{ , , ... } i i i i is h h h h h  where {0,1} ih . We generate hash codes by constructing shash functions  where {} ii X . These functions are as follows:

(1)

(2) (3) (4) (5)

The above equations (2) map feature vectors to a one-dimension space by projecting withmlinear functions where the size of wisdm, and the size of bis1 m. Then, we usesigmoid to make the output vector approximate to 0 or 1. The projected one-dimension vectors map into another space to extend non-linear functions, which we call dimension conversion and is equivalent to a neural network. We express this process as Equation (4), and we apply Equation (5) to make the hash code 0 or 1 by using a thresholding method. Generally, we refertoikhas the relaxed hash code. In NDVR, one classical way to generate hash code for a videois to process the relaxed hash codes of its representative keyframes by first performing averaging and then thresholding operations [21].

(6)

Equation (6) shows a generated hash code vector for a video in whichiindis the set of key frame indices for the video andiindis its cardinality. Finally, we generated the Vshash code matrix for all videos.

When making a list of output videos close to a query video, accurately measuring thesimilarity between the videos is very significant. Therefore, this study focuses on how tocompute the optimal hash codes from the feature vectors to ensure correct similarity information between videos. When there is ground truth information available regarding therelevance between videos, constructing probabilities by rewarding actually related videos with a score of 1 and non-related or unknown ones with a score of 0 is helpful. We refer to this aspand express this as:

(7)

Assuming that each |ijprepresents the probability of theith video for queryj, a natural way of learning the hash functions is re-computing such probabilities in the space of hash functions and minimizing the difference between two sets of probabilities. Therefore, the probability equationqis:

(8)

We increase the probability that we have extractedix and jxfrom near-duplicated videosby learning hash functions using available ground truth information. We can assess the hashing & rsquo;s quality by examining how well the probabilities of pand qmatch. We measurethe difference between two conditional probabilitiespand qusing the following KL-divergence method:

(9)

According to the decrease in cross-entropy value, we have increased the probability of similar objects mapping into the same group. Hash functions in Equations (2), (3), (4), and (5) consist of weight and bias parameters. Therefore, we can convert the optimization problem for hash functions into a minimization problem of composite KL-divergence values. We solve this problem by employing a gradient descent algorithm. We can compute the gradient using the following compound function derivation.

(10)

(11)

As has been established in many machine-learning studies, placing bias on the 0th weight vector means that there is no need to determine the gradient of the bias. Evidently, the targeted gradients depend on the different components of iOz,(2) izw,izhand (1) izw. The components can be calculated as follows.  

(12)

(13) (14) (15)

Substituting (12), (13), (14), and (15) into (10) and (11) permits the derivation of the complete formulations of

.Generating binary code through learned hash functions means that similar videos are more likely to map into the same buckets. In addition, if we can calculate similarity using binary code, we can decrease the retrieval speed because we have only calculated the Hamming distance, which uses a bit operation. This leads to avoiding costly pairwise keyframe comparisons and can effectively improve the retrievalefficiency. In terms of video retrieval, the computational complexity of computing a hash codehas a low cost of around O( 3 d ), where d is the length of the input vector. This phase has verysimple operations such as linear combination, sigmoid, and thresholding compared to theretrieval phase, so it does not affect the retrieval time. In the retrieval phase, the bit countoperations for the hamming distance calculation leads to a very fast online NDVR system. Therefore, NDVs can be found by linear search O(n )[22]. This paper demonstrates its efficiency in the results section.

5. Experiments and results

5.1 Dataset and metric

This study tested the proposed method through experiments using a publically available webvideo dataset. The CC_WEB_VIDEO dataset [6] consists of 12,790 video clips downloaded from video sharing websites such as YouTube, Google, and Yahoo! through keyword search, and is organized into 24 sets. The set has 398,015 keyframes in total. In previous research, twonon-expert assessors were asked to watch videos from this dataset with one query at a time, and assessors labeled all videos with statuses (E: Exact duplicate, S: Similar video, V:Different version, M: Major change, L: Longer version, X: Dissimilar video, or -1: Video does not exist) according to their judgment. Therefore, this dataset provides reliable ground truthinformation for all video clips. In addition, the most popular video was selected as the seed video for each query for near-duplicate video retrieval.

Retrieval performance evaluations commonly use the classic metric of the mean average precision (MAP). We use the precision-recall curve and MAP.

(10)

Where G is the ground truth set of redundant videos and D is the detected one. 

5.1 Experimental setup

The hash code length s influences retrieval performance and efficiency, so selecting the optimal length hash code is important. We prove the retrieval performance according to thelengths by testing changing the length from 500 and 1,000 with step size 50. Here, we applied LBP, HSV, and CAM for this test and we set the same experimental conditions except the hash code length.

Table 1. Applications in each class

Table 1 shows the changes in MAP performance. We can see from the results that it provides a quite similar retrieval performance in the range 800-1,000 even though the hash code lengths differ. In addition, in the case of the retrieval time, the longer the hash code, the greater the retrieval time margin. Therefore, we fixed the hash code length at s = 900 whileconsidering the MAP and computation time.

5.2 Baseline

In this section, we describe baselines algorithms to compare our method.


1) Spectral hashing (SH) [10]

We base the spectral hashing on analyzing the ksmallest single-dimension analytical Eigen functions of pLusing a rectangular approximation along every PCA direction. This uses spectral relaxation such as PCA. Then, they find the smallest element of the data whosedimensions are reduced by PCA. Finally, we convert this to binary code along with theksmallest eigenvalues.


2) Multiple feature hashing (MFH) [23]

This system proposed a sophisticated multiview method called MFH by extending SPH. MFH learns the training videos’ hash codes and a group of hash functions to generate hash codes for videos outside the training set; it encodes the information provided by the HSV and LBP features as a neighbor graph and seeks a hash function to preserve the desired neighborstructure.

3) Stochastic Multiview hashing (SMVH) [7]

This method learns binary strings to characterize data samples by combining multiplefeature types and auxiliary information through a stochastic matching procedure of neighborhood probabilistic models. This learns the mapping functions stochastically by maximizing a mixture of the generalized retrieval precision and recall scores. The scores areapproximated by the composite Kullback-Leibler (KL) divergence computed between two probabilistic models constructed in the original feature space and a relaxed hash code space.


4) Self-taught hashing (STH) [9]

This system relies on the hashing method STH, which shares a similar hash code training procedure to SPH but achieves out-of-sample extension through a different scheme based on linear SVM.

5) Hierarchical fusing (HF) [6]

This system combines the global and local features, by firstly using the color histogramsignature to detect the NDVs with high confidence and filtering out the very novel ones, and then performing a pairwise comparison based on the local features to further determine the uncertain videos.

6) Unsupervised Stochastic Multiview hashing (USMVH) [7]

This system is an unsupervised version of SMVHThe following section describes the proposed method and the results of the overall comparison with other methods.

5.3 Results

For the experiment, we extracted 768 HSV and 256 LBP features and 1024 CAM features from each keyframe. We compited the retrieval speed using Python 3.5 running on a serverwith an Intel i7 4770 CPU, 16 GB RAM, and 64-bit Windows 7 operating system.

Table 2. Experiment results

Fig. 7. Experiment results with a Precision-Recall curve

Table 2 summarizes the MAP performance of all methods and the retrieval speed for CC_WEB_VIDEO. In addition, Fig. 7 shows the Precision-Recall curve for the experimentresults. According to this result, LBP, HSV, and CAM features provide better performance than cases that use LBP and HSV. In addition, the proposed method (98.98%) is better than the other hashing method with LBP and HSV. Table 2 also compares the retrieval speed. The proposed method’s set of feature vectors is relatively larger than other methods that use LBP and HSV, so the proposed method needs more hash functions. Therefore, we can see that the proposed method has a longer retrieval time. However, the retrieval speed is dramarically decreased compared to not using hashing methods.

Fig. 8. AP performance comparison

Finally, we tested the average precision (AP) performance of a different feature set overeach of the 24 queries. For most queries, a feature set that includes LBP, HSV, and CAM shows better performance that one that only includes LBP and HSV. Although there are a fewindividual cases such as Q10, Q16, and Q24 for which the LBP and HSV feature set shows better performance than LBP, HSV, and CAM, this does not change the overall conclusion when taking all queries into account.

5. Conclusion

This paper proposes a novel NDVR system that supports large-scale retrieval. For this, weextracted keyframes from each video at regular intervals. Then, we extracted both commonly used features (LBP, HSV) and the new image feature from each keyframe. We accurately retrieved the NDVs by considering the new image feature provided auxiliary information suchas the object localization of keyframes. The extracted features make up a vector set that we convert to simple binary strings through a set of mapping functions such that the similarity comparison can be efficient. Lastly, we calculated the similarity to search for NDVs. We examined the NDVR system’s effectiveness and compared it against previous NDVR systemsusing the public video collection CC_WEB_VIDEO. The proposed method dealt withimportant accuracy issues in recent NDVR studies and contributed to performance improvement.

Acknowledgment

This research was supported by a 2019 Research Grant from Sangmyung University.

References

  1. Liu, J., Huang, Z., Cai, H., Shen, H. T., Ngo, C. W., and Wang, W., "Near-duplicate video retrieval: Current research and future trends," ACM Comput. Surv., vol. 45, no. 4, Art. no. 44., pp. 218-227, 2013.
  2. J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong, "Multiple feature hashing for real-time large scale near-duplicate video retrieval," in Proc. of 19th ACM Int. Conf. Multimedia, pp. 423-432, 2011
  3. M. Cherubini, R. De Oliveira, and N. Oliver, "Understanding near-duplicate videos: A user-centric approach," in Proc. of 17th ACM Int. Conf. Multimedia, pp. 35-44, 2009.
  4. H. T. Shen, X. Zhou, Z. Huang, J. Shao, and X. Zhou, "UQLIPS: A real-time near-duplicate video clip detection system," in Proc. 33rd Int. Conf. Very Large Data Bases, pp. 1374-1377, 2007.
  5. H.-K. Tan, C.-W. Ngo, R. Hong, and T.-S. Chua, "Scalable detection of partial near-duplicate videos by visual-temporal consistency," in Proc. of 17th ACM Int. Conf. Multimedia, pp. 145-154, 2009.
  6. X. Wu, A. G. Hauptmann, and C.-W. Ngo, "Practical elimination of near-duplicates from web video search," in Proc. of 15th ACM Int. Conf. Multimedia, pp. 218-227, 2007.
  7. Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, John Y. Goulermas, “Stochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval,” IEEE Transactions on Multimedia, Vol. 19, No. 1, pp. 1-14, 2016. https://doi.org/10.1109/TMM.2016.2610324
  8. L. Shang, L. Yang, F. Wang, K.-P. Chan, and X.-S. Hua, "Real-time large scale near-duplicate web video retrieval," in Proc. of 18th ACM Int. Conf. Multimedia, pp. 531-540, 2010.
  9. D. Zhang, J. Wang, D. Cai, and J. Lu, "Self-taught hashing for fast similarity search," in Proc. of 33rd Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, pp. 18-25, 2010.
  10. Y. Weiss, A. Torralba, and R. Fergus, "Spectral hashing," in Proc. of Adv. Neural Inf. Process. Syst. Conf., pp. 1753-1760, 2009.
  11. J. Yuan, L.-Y. Duan, Q. Tian, S. Ranganath, and C. Xu, "Fast and robust short video clip search for copy detection," in Proc. of Adv. Multimedia Inf. Process. Conf., pp. 479-488, 2004.
  12. G. Zhao and M. Pietikainen, "Dynamic texture recognition using local binary patterns with an application to facial expressions," IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 915-928, Jun. 2007. https://doi.org/10.1109/TPAMI.2007.1110
  13. D.-G. Lowe, "Distinctive image features from scale-invariant keypoints," Int. J. Comput. Vis., vol. 60, no. 2, pp. 91-110, 2004. https://doi.org/10.1023/B:VISI.0000029664.99615.94
  14. D.-G. Lowe, "Object recognition from local scale-invariant features," in Proc. of Int. Conf. Comput. Vis., pp. 1150-1157, 1999.
  15. Y. Ke and R. Sukthankar, "PCA-SIFT: A more distinctive representation for local image descriptors," in Proc. of IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., pp. 506-513, Jun.-Jul. 2004.
  16. Chenggang Yan, Liang Li, Chunjie Zhang, Bingtao Liu, Yongdong Zhang, Qionghai Dai, "Cross-modality Bridging and Knowledge Transferring for Image Understanding," IEEE Transactions on Multimedia. (Early Access), pp. 1-1, 2019
  17. Chenggang Yan, Liang Li, Chunjie Zhang, Bingtao Liu, Yongdong Zhang, Qionghai Dai, "A Fast Uyghur Text Detector for Complex Background Images," IEEE Transactions on Multimedia,Vol. 20, Issue. 12, pp. 3389-3398, 2018. https://doi.org/10.1109/TMM.2018.2838320
  18. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba, "Learning Deep Features for Discriminative Localization," in Proc. of Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, PP. 2921-2929, 2016.
  19. J. Song, L. Gao, Y. Yan, D. Zhang, and N. Sebe, "Supervised hashing with pseudo labels for scalable multimedia retrieval," in Proc. of 23rd ACM Int. Conf. Multimedia, pp. 827-830, 2015.
  20. R. Salakhutdinov and G. E. Hinton, "Learning a nonlinear embedding by preserving class neighbourhood structure," in Proc. of 11th Int. Conf. Artif. Intell. Statist., pp. 412-419, 2007.
  21. W. Liu, J.Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, "Supervised hashing with kernels," in Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 2074-2081, 2012.
  22. A. Gionis et al., "Similarity search in high dimensions via hashing," in Proc. of 25th Int. Conf. Very Large Data Bases, pp. 518-529, 1999.
  23. J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, "Effective multiple feature hashing for large-scale near-duplicate video retrieval," IEEE Trans. Multimedia, vol. 15, no. 8, pp. 1997-2008, Dec. 2013. https://doi.org/10.1109/TMM.2013.2271746
  24. David G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, Nov. 2004. https://doi.org/10.1023/B:VISI.0000029664.99615.94
  25. P. Viola, M. Jones, "Rapid object detection using a boosted cascade of simple features," in Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 2001.
  26. Mustafa Ozuysal, Michael Calonder, Vincent Lepetit, Pascal Fua, "Fast Keypoint Recognition Using Random Ferns," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 3, pp. 448-461, Jan. 2009. https://doi.org/10.1109/TPAMI.2009.23