I. INTRODUCTION
The traditional sliding window algorithm is simple and widely used for candidate window selection. The sliding window algorithm was proposed in [1], and then became apopular method [2].
There are two popular multi-scale sliding windowalgorithms. Multi-scale detection windows are used in onealgorithm, while the images are multi-scale resized in the other one. However, there is usually a large overlap between the adjacent sliding windows.
Although the sliding window algorithm has been widely used in various computer vision systems, it has twosignificant drawbacks. First, the number of candidate windows is very redundant, which degrades the real-time performance. An intuitive way to reduce the number of candidate windows is to increase the sliding step length of the window, but this may miss some positive pedestriandetection. Second, some non-pedestrian background areas, such as the sky and some complex background windows, are also judged as pedestrians by the classifier, which causes false detection.
The Caltech Pedestrian Dataset consists of approximately 10 hours of 640x480 30Hz video taken from a vehicle driving through regular traffic in an urbanenvironment. About 250,000 frames (in 137 approximatelyminute long segments) with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated [3]. The annotation includes temporal correspondence between bounding boxes and detailed occlusion labels.
II. RELATED WORKS
Since the candidate window redundancy leads to lowdetection efficiency, a general target detection method is proposed to pre-select areas with a high recall rate, low computational complexity, high quality and a short time period [4]. With the gradual deepening of research in recent years, the scholars have proposed many general target detection methods [5], in which selective search is a classic method [6].
Selective search was proposed by J.R.R. Uijlings, which combines exhaustive search and image segmentation, and applies hierarchical clustering to the merging of regions [7]. The method first divides the image into several small regions, and then merges the regions belonging to the identical target to localize all the targets [8]. Compared with the traditional single strategy, selective search combines multiple strategies to enhance the robustness. Additionally, compared with the exhaustive search, the time consuming is greatly reduced due to the remarkably reduced search space. Because of its superior universal target detection performance [9], selective search became popular in many state-of-the-art object detection methods and is used for the extraction phase of the target candidate window [10, 11].
The selective search algorithm consists of two models. The fast model generates approximately 2000 windows onan image. The recall rate is 98%, and the maximum average best overlap (MABO) reaches 0.804. The quality model produces about 10,000 windows with a recall rate of 99.1% and an MABO of 0.879 [12]. It is worth mentioning that the average speed of the algorithm processing is far from thereal-time requirement for the fast extraction of the object candidate window in object detection fields [13]. In addition, the dimension of the selective search is too high.
The BING (Binarized Normed Gradients) algorithm [14]has received extensive attention from industry scholars because of its superior comprehensive detection performance. BING algorithm not only achieves similardetection accuracy to those of the selective searchalgorithm and the objectness algorithm on the pascal voc2007 dataset, but also improves the detection speed by three orders of magnitude. In only 3 ms, it can extract 1000 candidate windows that may be objects. Additionally, therecall rate is about 96%. Therefore, it is significant to improve the extraction of pedestrian candidate windowbased on the BING algorithm.
A general object is considered as the object that is notrelated to a category. The BING algorithm replaces the traditional sliding window scanning method in the field of object detection, and extracts as many candidate windows containing all objects as possible within milliseconds. The BING algorithm is computationally efficient because it usessimple gradient magnitude features and a linear SVM(Support Vector Machine) classifier. Under a fixed-size window, the gradient magnitudes of the object and the background are significantly different. The gradient distribution of the object is cluttered, while the gradient distribution of the background is uniform. The main reasonfor the difference of gradient distribution is that the objects usually have fully defined closed boundaries and centers [15, 16].
In Fig. 1(a), the red rectangles with dashed lines represent the general objects, which are a ship and a person. The green rectangles represent the random background portion. As shown in Fig. 1(c), after extracting the normed gradient (NG) features of all the rectangular frames, the distribution pattern of the normalized gradient features, which are extracted by the red rectangle frame with d ashed lines, and the distribution pattern of the normalized gradient features, which are extracted by the green rectangular frame, are significantly different. The gradient features in the red boxes are more cluttered, while the gradient features in the green boxes are more evenly distributed.
The reasons why the BING algorithm is such highly efficient are:
(1) The original image is scaled to 36 different scales. Although some original information is lost, the structuraloutline of the objects remains intact. Therefore, the matching with an "8 × 8" template does not degrade the detection effect.
(2) The gradient feature contains a small amount of data, describing the contour information of an object. The BINGalgorithm further simplifies the image data, discards the last four bits of the 8-bit data, and replaces the first four bits with its own data. This process of data reduction reducessubsequent bit operations by half the amount of shiftoperations.
(3) From the computer hardware perspective, all the operations of shifting image pixels into an alignmentoperation greatly accelerate the calculation process.
Fig. 1. Gradient distribution patterns of objects and background. (a) source image, (b) normed gradients map, (c) 8 × 8 NG feature, (d) learn model.
III. PEDESTRIAN OBJECT EXTRACTION
ALGORITHM BASED ON AN IMPROVED
BING ALGORITHM
In the original BING paper, the training set ispascal voc2007. The input image is resized to 36 differents cales to detect objects of various sizes.
To better detect various pedestrian objects in daily streetscenes, this paper proposes a pedestrian object extractionalgorithm based on the improved BING algorithm.
The Caltech pedestrian dataset is selected as the training set. The object detection template in BING is set to the "8× 16" size for the contour feature of the pedestrian. The pedestrian detection scale is set to a fixed 1:2 form. Thespecific detection sizes are set to "20 × 40", "40 × 80", " 60× 120", "80 × 160", "100 × 200", "120 × 240", "140 × 280 " , and "160 × 320". The Caltech datasets set00~set05 are the training set and set06~set10 are the test set. The pedestriansin the dataset are divided into three sizes, then the pedestrians at a close distance have more than 80 pixels, the pedestrians at a medium distance have 30-80 pixels, and pedestrians at a long distance have less than 30 pixels. Each frame of the 30 frames is used, and the training samples are 4250 images. Fig. 2 shows an example of some training samples in the Caltech dataset.
Fig. 2. Caltech Pedestrian Dataset Examples.
An improved BING template training process forpedestrian detection is as follows:
(1) Preparation stage for true positive and false negativesets
4250 images of the Caltech training dataset are used in this step. The images are resized to 8 different sizes. An "8× 16" sized box is extracted for each sized pixel. Theresized image with different scales are shown in Fig. 3.
Fig. 3. Training image resized to 8 different scale.
(2) First-level SVM training
The true positive and false negative of all scales are resized to an "8 × 16" size, and the BING features are extracted for linear SVM training.
(3) Second-level SVM training
First, the BING template trained in the first-level is loaded. The training images are resized to 8 different sizes. The first-level BING template is used for general object detection at each size, and a small number of candidate windows are selected to form each specification using non-maximum suppression. Next, the retained windows of allscales are detected with the annotation information. The true positives have more than 50% of the intersection area, and the other ones are false negatives. The detection scores of true positive and false negative at different scales aretaken as the features. Each SVM is trained once for each scale, that is, eight SVMs are trained. Then the final weight and offset are obtained.
The detection phase is divided into two steps. First the input image is resized to 8 different sizes, and the 8 × 16sliding windows scan the 8 resized images. The first-level BING template is used for detection. A non-maximum suppression is used according to the score, and a partial detection window at each scale is retained. Then theremaining window is used to calculate the final scores, the scores are output from high to low. The overall processes of training and testing of the BING template are shown in Fig. 4.
Fig. 4. Flow chart of training and pedestrian detection using a BING template. (a) Training module, (b) Test module.
After training the pedestrian detection BING templates, they can be applied to extract the candidate window, and combined with any pedestrian classifier detection model. Fig. 5 shows the overall flow of the proposed algorithm. Foran input image, first, the BING template is used to extractall candidate windows that may contain pedestrians. Then, these windows are input into the SVM (Support Vector Machine) classifier for classification to obtain the final testresult.
Fig. 5. Process combined with an improved BING algorithm forpedestrian detection.
Ⅳ. EXPERIMENTAL RESULTS AND
ANALYSIS
The experiments are performed on Caltech dataset toverify the advantages of the proposed method. The methodis expected to significantly reduce the time cost fordetection, and its accuracy is comparable to that of the HOG(Histogram of Oriented Gradient) algorithm.
After the two-stage SVM training is completed, the linear SVM model learned with the BING features is shown in Fig. 6, which shows that the active white pixels areconcentrated on the silhouette edge of a pedestrian. The SVM weights are very similar to the HOG feature weights learned with SVM.
Fig. 6. Pedestrian detection BING template for training.
The detection effect is verified by adjusting the first-level BING template threshold to generate different numbers of candidate windows. The number of candidate windows decreases as the BING threshold increases.
Table 1 shows the detection time and missed detection rate at different BING thresholds. When the threshold is setin the range of [-0.05, 0.01], the missed detection rateremains unchanged, which are all 68%. The larger the BING threshold is, the faster the detection speed is. Whenthe BING threshold is further increased, although the detection time is further decreased, the missed detection rate is greatly increased by discarding a large number of candidate windows containing pedestrians. Thus the optimal detection effect can be obtained when the BING threshold is 0.01. The detection speed of this algorithm isthree times faster than that of the traditional Selective Search algorithm. And it has higher value in practical applications.
The Miss Rate formula is:
\(M R=F N /(F N+T P) ,\) (1)
where FN is False Negative, TP is True positive. Time(s) indicates the time to process an image.
Table 1. Detection results at different BING thresholds.
Ⅴ. CONCLUSIONS AND FUTURE WORKS
Firstly, the development history of the sliding windowdetection in the object detection field is introduced, and its shortcomings are summarized. Then, the widely used general object detection technologies are introduced, especially the selective search algorithm. Because the selective search algorithm has a serious time loss in the extraction of candidate windows, in this paper, improved BING algorithm is used to remarkably accelerate the speed, while the proposed method can achieve the similardetection effects to those of the selective search algorithm. The dedicated pedestrian dataset from Caltech is used totrain the BING template, and the aspect ratio of the templateis set to 1:2, which is "8×16" according to the appearancecharacteristics of a pedestrian. In addition, only the window with the aspect ratio of 1:2 is reserved during the detection phase, that is, the pedestrian objects at 8 different scales are detected. Finally, the pedestrian candidate windows extracted with the BING template are input to the SVMsplitter for accurate classification. The experimental results show that the detection process time in this paper is only one-third of that in the original sliding window detection, while the detection accuracy does not degrade.
The features extracted by CNN (convolutional neural network) are commonly better than those of the traditional algorithms. We are going to combine the manual designcandidate window with CNN to further improve the detection performance.
Acknowledgement
This work was supported partially by the National Natural Science Foundation of China (Grants No. 61763033, 61866028, 61662049, 61741312, 61881340421,61663031, and 61866025), the Key Program Project of Research and Development (Jiangxi Provincial Department of Science and Technology) (20171ACE50024, 20161BBE50085), the Construction Project of Advantageous Science and Technology Innovation Team in Jiangxi Province (20165BCB19007), the Application Innovation Plan (Ministry of Public Security of P. R. China)(2017YY CXJXST048), and the Open Foundation of Key Laboratory of Jiangxi Province for Image Processing and Pattern Recognition (ET201680245, TX201604002).
References
- Papageorgiou C, Poggio T., "A trainable system for object detection." International Journal of Computer Vision, vol. 38, no. 1, pp. 15-33, Nov. 2000. https://doi.org/10.1023/A:1008162616689
- Dalal N, Triggs B., "Histograms of Oriented Gradients for Human Detection." Computer Vision and Pattern Recognition (CVPR), San Diego, pp. 886-893, Jun. 2005.
- Dollar P, Wojek C., "Pedestrian detection: A benchmark." Computer Vision and Pattern Recognition (CVPR), Miami, pp. 304-311, Jun. 2009.
- Alexe B, Deselaers T, Ferrari V., "Measuring the objectness of image windows." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2189-2202, 2012. https://doi.org/10.1109/TPAMI.2012.28
- Endres I, Hoiem D., "Category-independent object proposals with diverse ranking." IEEE transactions on pattern analysis and machine intelligence, vol. vol. 36, no. 2, pp. 222-234, 2012. https://doi.org/10.1109/TPAMI.2013.122
- Alexe B, Deselaers T, Ferrari V., "What is an object?" in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, San Francisco, pp. 73-80, Jun. 2010.
- Endres I, Hoiem D., "Category independent object proposals," in European Conference on Computer Vision. Springer Berlin Heidelberg, pp. 575-588, 2010.
- Van de Sande K E A, Uijlings J R R, Gevers T., "Segmentation as selective search for object recognition," in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, Barcelona pp. 1879-1886, Nov. 2011.
- Zhang Z, Warrell J, Torr P H S., "Proposal generation for object detection using cascaded ranking svms," in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, Barcelona pp. 1497-1504, Nov. 2011.
- Uijlings J R R, van de Sande K E A, Gevers T., "Selective search for object recognition," in International journal of computer vision, vol. 104, no. 2, pp. 154-171, Sep. 2013. https://doi.org/10.1007/s11263-013-0620-5
- Girshick R., "Fast R-CNN," in Proceedings of the IEEE International Conference on Computer Vision, Santiago, pp. 1440-1448, Dec. 2015.
- Girshick R, Donahue J, Darrell T., "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition. Columbus, pp. 580-587, Jun. 2014.
- Felzenszwalb P F, Huttenlocher D P., "Efficient graph-based image segmentation." International Journal of Computer Vision, vol. 59, no. 2, pp. 167-181, Jun. 2004. https://doi.org/10.1023/B:VISI.0000022288.19776.77
- Cheng M M, Zhang Z, Lin W Y, et al., "BING: Binarized normed gradients for objectness estimation at 300fps," in Proceedings of the IEEE conference on computer vision and pattern recognition. Columbus, pp. 3286-3293, Jun. 2014.
- Forsyth D A, Malik J, Fleck M M., "Finding pictures of objects in large collections of images," in International Workshop on Object Representation in Computer Vision. Springer Berlin Heidelberg, pp. 335-360, 1996.
- Heitz G, Koller D., "Learning spatial context: Using stuff to find things," in European conference on computer vision. Springer Berlin Heidelberg, pp 30-43, 2008.