DOI QR코드

DOI QR Code

Location-Based Saliency Maps from a Fully Connected Layer using Multi-Shapes

  • Kim, Hoseung (Visual Information Processing, Korea University) ;
  • Han, Seong-Soo (Division of Liberal Studies, Kangwon National University) ;
  • Jeong, Chang-Sung (Department of Electrical Engineering, Korea University)
  • Received : 2020.11.10
  • Accepted : 2020.12.19
  • Published : 2021.01.31

Abstract

Recently, with the development of technology, computer vision research based on the human visual system has been actively conducted. Saliency maps have been used to highlight areas that are visually interesting within the image, but they can suffer from low performance due to external factors, such as an indistinct background or light source. In this study, existing color, brightness, and contrast feature maps are subjected to multiple shape and orientation filters and then connected to a fully connected layer to determine pixel intensities within the image based on location-based weights. The proposed method demonstrates better performance in separating the background from the area of interest in terms of color and brightness in the presence of external elements and noise. Location-based weight normalization is also effective in removing pixels with high intensity that are outside of the image or in non-interest regions. Our proposed method also demonstrates that multi-filter normalization can be processed faster using parallel processing.

Keywords

1. Introduction

Recent technological developments have promoted research on computer vision based on the human visual system [1,2]. The human visual system recognizes and processes detailed and selectively prominent visual stimuli within a given scene, which can then be used to efficiently locate certain objects or areas of interest and to complete complex vision tasks, such as understanding the scene [3-5]. As such, research seeking to emulate this process is active in the fields of cognitive psychology, neuroscience, and computer vision, with a variety of potential applications, including CCTV, traffic control systems, games, and smartphone applications.

An image that shows a high-intensity salient area that is visually interesting is referred to as a saliency map. In the salient area, the brightness, color, and orientation between the background and the object are defined in other areas. In general, humans focus on salient areas within an image. Saliency maps were first introduced for situational awareness and image resizing. A method for automatically generating saliency maps was first published by Koch in 1985 [6]. Later, research on object tracking and clustering was actively conducted using saliency maps. Feature maps are generated using texture, color, and contrast and then combined to produce a single image using enhancement and normalization. However, objects and the background in saliency maps may have similarities in color, or it may be difficult to distinguish between objects and ambiguous backgrounds that contain external elements such as the sun, forest, and the ocean.

Saliency maps can be used to efficiently search for necessary data from a variety of media types, including images and videos, and to automatically extract objects of interest. However, it is difficult to extract an object from an image when there is little difference between the object and the background or when there is interference by external elements. To overcome these issues, we propose an approach to generate improved saliency maps by generating high-level feature maps with filtering using a fully connected layer. The proposed method generates a min/max contrast map and an existing color map to increase the difference in the contrast and color between the background and the object of interest. This method can strengthen the boundary between the object and the background when compared to using only a color map and an intensity map in an image where the boundary between the object and the background is indistinct due to the presence of external elements.

We create high-level feature maps through texture filtering, orientation filtering, and normalization processes and combine them in a fully connected layer. A local-based normalization process that calculates the weighting of the center position of the feature maps to generate a saliency map to emphasize the salient region near the center is then employed. In addition, because iterative calculations are required for filtering, normalization, and weight calculation, parallel processing is used to reduce the calculation time compared to previously proposed saliency maps.

2. Related work

2.1 Saliency map

Areas that have a high difference in color, brightness, and directionality and that are visually interesting are referred to as salient objects. In general, a viewer focuses on salient objects within a scene [7-9]. Models that identify salient objects were originally created for use in situational awareness and resizing. Computer vision research into the automatic tracking of salient objects was first introduced by Koch, while a representative study is the saliency map proposed by Itti. Saliency maps are a grayscale image in which the salient area of the original image is expressed using the pixel intensity. Many studies have used saliency maps to identify salient objects and understand human visual attention. Recently, a significant volume of research has been conducted in the fields of cognitive psychology [10,11], neuroscience [12,13], and computer vision [14-16] to improve complex visual information processing, such as understanding a scene.

E1KOBZ_2021_v15n1_166_f0001.png 이미지

Fig. 1. Architecture of saliency map

Itti explained the human visual search method using the feature integration theory. This theory is based on biological structures and has since become the basis of several models for salient objects. The model proposed by Itti uses Gaussian pyramids to decompose an input image into eight scales based on features such as color, brightness, and orientation. Feature maps are then generated for each of these scales based on the difference and contrast between the center and the surroundings. The feature maps for each scale are combined to create normalized conspicuity maps, and these are then linearly combined to generate a saliency map. The most salient areas have a higher brightness, and the results are presented as a contrast.

There are two main approaches to creating a saliency map: bottom-up using the internal information (i.e., intrinsic cues) of the image, such as brightness, color, contrast, and texture; and top-down using external information (i.e., extrinsic cues) obtained by learning the relationship with similar images. The bottom-up method is the most widely used of the two. A common requirement for methods that do not utilize learned information is that there be a strong difference in color and brightness between the object and the background. This method is used for intuitive, explicit, or implicit salient object detection when the salient area has a high contrast with the background. However, the salient area may not be displayed using this approach when the difference in the color, brightness, and directionality of the pixels between the background and the salient area is too small or subject to external interference. In this paper, we use min/max contrast to maximize the difference between salient areas and the background. This improves the performance for images of the same color with less brightness and contrast.

2.2 Fully connected layer

Fully connected layers, in which one input value is used to extract various features through multiple filters, are widely used in convolution neural networks (CNNs) [17-21]. In general, multiple filters are employed to extract and learn features. Because a single image has several features, it is difficult to extract and learn features using only one filter. Thus, multiple filters are needed to extract multiple features from an image. A layer composed of filters that extract different features is useful in determining the composition of these various features within an input image.

An important issue when employing a fully connected layer is determining how to classify the features extracted using the multiple filters and what representative features to select. A variety of methods for representative feature extraction from a fully connected layer in a CNN have been suggested, including max pooling [22], average pooling [23], L2-norm pooling [24], and subsampling [25]. Of these, max pooling is most often used, while the softmax [26,27] function is most often used for multi-classification, as compared to the sigmoid function, which has only true or false values.

Our proposed method employs repeated filtering and normalization through the fully connected layer. Normalization removes noise and lowly weighted values and strengthens strongly weighted values within the saliency map, producing results that are similar to those for the softmax function in image processing.

2.3 Location-based normalization

Evaluating areas of interest and the importance of objects is useful in computer vision tasks such as image classification and object tracking [28-31]. In general, objects and regions of interest have various vectors for different colors, contrasts, and orientations when compared to the background [32]. The most common methods for extracting and classifying vectors extract mathematically validated features and calculate their weights. However, these methods produce results that differ from those produced by humans when evaluating the importance of the areas of interest and objects because there is a high probability that the boundary between the background and the object will include noise and vectors from non-important objects. Our proposed approach thus mathematically calculates weights according to the pixel location. All pixels in the resulting image are calculated using a normalization process, with the weight increasing for pixels closer to a specific location.

2.4 Canny edge filter

The Canny edge filter is an algorithm developed by John F [33]. Canny in 1986 to remove gray-related edges while finding contours within an image. This filter consists of several steps. First, a Gaussian filter is used to smooth the image. Second, a Sobel filter is used to calculate the magnitude of the gradient vector. Then, in order to obtain a thin edge, it is converted to 0 using a convolution calculation, except for the pixel whose magnitude is the maximum value in the gradient vector direction. Finally, two thresholds are used to extract the connected edge, with the edges connected from the high to the low threshold. In this paper, each edge image is created from an image in which the color and brightness have been separated.

2.5 Laplacian filter

A Laplacian filter uses the second derivative to determine the strength of an edge after the first derivative is used to identify the presence or absence of an edge [34]. A Laplacian filter using the second derivative blurs low frequencies and emphasizes high frequencies, thus it is excellent in representing the strength of an edge. In this paper, this filter is used to reveal the edges of different intensities in images separated by color and brightness.

3. Model

Generally, color images of various resolutions are used as input images. The proposed model generates a high-dimensional feature map via filtering using a fully connected layer following the structure of a conventional saliency map. The input image is subsampled with a Gaussian filter to calculate an eight-octave image reduction factor in the 1:1 range of 1:256.

Each feature map is filtered through the fully connected layer and then calculated using a series of linear "center–surrounding" operations similar to the visually receptive region. Human visual neurons are most sensitive to small centers, while the neural response to non- center areas (i.e., the surroundings) is suppressed. The structure of these neurons is particularly suitable for detecting prominent positions in the periphery and is similar to general cortical principles. Center–surrounding operations calculate the difference between the fine scale and the warp scale. The center is a 2, 3, 4 scale image, and the surroundings are a 3, 4 scale image. The difference from the image of the scale shown in A is used to calculate the difference at the same position.

E1KOBZ_2021_v15n1_166_f0002.png 이미지

Fig. 2. General architecture of the model

3.1 Feature extraction

Feature extraction in our proposed method uses a different color and contrast map than previous saliency maps to generate a Gaussian filter. The color feature creates four channels (R, G, B, and Y), similar to previous saliency maps. The R, G, and B channels are normalized to separate tones. Broadly tuned color channels for R, G, B, and Y are generated.

The contrast feature produces two contrast channels. A min/max contrast map is generated through histogram expansion and contraction for contrast enhancement and suppression. The center–surrounding operations generate a feature map between the center fine scale c and the surrounding coarse scales.

The first of the feature maps represents the color channel known as the "color double- opponent" system of the cortex. Human visual cells have a property in which, if one color is reacted to, another color is suppressed for the following pairs: red/green, green/red, blue/yellow, and yellow/blue. Therefore, an RG (c, s) feature map is generated for red/green and green/red and a BY (c, s) feature map is generated for blue/yellow and yellow/blue.

\(\begin{array}{l} R G(c, s)=|(R(c)-G(c))-(G(s)-R(s))| \\ B Y(c, s)=|(B(c)-Y(c))-(Y(s)-B(s))| \end{array}\)       (1)

A second feature map is constructed by calculating the weights of the min/max contrast map. The min contrast map obscures the strong boundaries of the image but removes small, ambiguous areas, and the max contrast map enhances the boundaries between the object and the background. Therefore, the weight of the area that is strongly observed in both is calculated to generate the feature map C(c, s).

\(C(c, s)=\frac{(\operatorname{Max}(c, s)+\operatorname{Min}(c, s)}{2}\)       (2)

3.2 Improved saliency map

In general, saliency maps indicate salient areas (bold) using the weight of scalars from all positions of the image (field of view) and guides the priority based on this. It uses bottom-up inputs modeled in a similar way to neural networks with a combination of feature maps. However, because all feature maps are combined, some feature maps contain noise. Also, the salient areas that appear on many feature maps may not be visible in regions with surrounding noise and in inconspicuous areas.

Fully connected layer filters employ Canny, Laplacian, and Gabor filters. Canny and Laplacian filters can be used to emphasize areas with high frequency connected edges. After the center–surrounding process, the results of the Canny and Laplacian filters are calculated with the same weights. Finally, the feature map created using color and contrast can be created as a result map by emphasizing the four angles that humans react sensitively to using the Gabor filter. The existing saliency map has proposed a normalization method that combines feature maps to emphasize areas that are commonly noticeably and eliminate the noise that appears in some feature maps. This paper presents location-based normalization from the center. In general, it can be defined as an important object that has a large difference in image size, color, brightness, and directionality, and an object closer to the center of the image reacts more sensitively. It presents normalization in which pixels near the center of the image are emphasized after estimating the salience of the color, brightness, and orientation for each position.

(1) Normalization emphasizes the salient area of the image.

(2) The values in the map are normalized to a fixed range [0..C] in order to determine different weights according to the distance from the center; \( C_{x y}=\frac{C(W, H)}{2}\).

(3) The map is multiplied by \(|C-D(x, y)|\)

In this paper, we produce a saliency map that combines the feature maps created using the fully connected layer with repeated normalization. Filtering with a fully connected layer produces many feature maps and requires significant processing time. If the generated feature maps are combined using normalization, the process time can be reduced. The noise of the features generated after filtering can be eliminated and emphasized. The combined result map removes noise and less salient areas. The normalization combination method uses an across-scale combination to create conspicuity maps, which are combined feature maps.

3.3 Parallel processing

Filtering through the fully connected layer requires more time for normalization calculation because more result maps are produced than in traditional methods. In this paper, we employ parallel processing using multi-threads to overcome this problem. By processing convolution calculations that repeat normalization in parallel, the processing time can be effectively reduced. Therefore, high-resolution images can be processed efficiently.

4. Experimental results

We employ a test environment with a Ryzen 4.2 GHz processor and 16 GB of memory. An input image of 1280x1024 is used, and the processing speeds for different image resolutions are compared. The min contrast results blur the contrast and borders of the area of interest but tend to blur the background and noise. The max contrast results exhibit a large difference in the contrast and boundary between the areas of interest and the background. Using our proposed method, common and strong boundaries are observed in the min/max contrast results. Fig. 3 shows the results of emphasizing the boundaries that appear strongly in the min/max contrast map.

E1KOBZ_2021_v15n1_166_f0003.png 이미지

Fig. 3. Input image (top), min contrast image (middle) and max contrast image (bottom)

We compare a conventional saliency map with the proposed saliency map. Improved performance is observed for an image in which the contrast and the color differences in the area of interest are small. Fig. 4 shows the image focused on the object (bright pixel) than the previous method in the result of the proposed saliency map as the result image of the original saliency map and the proposed saliency map.

E1KOBZ_2021_v15n1_166_f0004.png 이미지

Fig. 4. Input image (top), original saliency map image (middle) and proposed saliency map (bottom)

To evaluate the performance of location-based normalization, an image is binarized to show only the bright areas, which are denoted as important areas. We compare the binarized images using the minimum threshold. In Fig. 5, it can be confirmed that important objects are vague or not represented in the binarized image for the conventional saliency map. Fig. 6 presents a binarized image for the proposed location-based saliency map. It can be seen that the important objects are clearer than with the conventional saliency map.

E1KOBZ_2021_v15n1_166_f0005.png 이미지

Fig. 5. Input image (top), original saliency map (middle) and binarized results (bottom)

E1KOBZ_2021_v15n1_166_f0006.png 이미지

Fig. 6. Input image (top), proposed saliency map (middle) and binarized results (bottom)

The conventional saliency map using the Gabor filter resulted in a delay in the processing time when generating the directional feature maps for all eight scales. The proposed method can reduce the number of feature maps by filtering and normalization using the fully connected layer. Fig. 7 displays the processing times for the proposed and conventional methods, which exhibit similar results.

E1KOBZ_2021_v15n1_166_f0007.png 이미지

Fig. 7. Processing time of entire process of our method for each image resolution

Finally, the processing time using parallel processing was evaluated. Iterative filtering and normalization lead to longer computational times. Fig. 8 shows that there is about 30% improvement in the processing time in generating the experimental results when using parallel processing with a CPU.

E1KOBZ_2021_v15n1_166_f0008.png 이미지

Fig. 8. Processing time of parallel and single computing for each image resolution

5. Conclusion

In this paper, we proposed an improved saliency map that utilizes three major techniques. The first is to create a feature map using a fully connected layer. This method is designed to show the features that each feature map has for other intensities. The second is the use of min/max contrast and color. Creating a feature map using contrast has been shown to remove noise and small objects and create strong object boundaries. Also, we used location-based normalization to calculate weights for a particular location. The closer to the center, the bolder and stronger the image is. Outside objects and areas are thus removed to focus on the central object and area. Finally, by implementing a new model that eliminates unnecessary calculations, calculation times that are similar to the conventional method are possible. By parallel processing the iterative calculations for normalization, a large-capacity image the calculation time of the data is shown.

Acknowledgement

This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.2020-0-02219, Development of technology for irregularly shaped waste classification based on deep learning).

References

  1. Y. Yang, M. Yang, S. Huang, Y. Que, M. Ding, and J. Sun, "Multifocus image fusion based on extreme learning machine and human visual system," IEEE access, vol. 5, pp. 6989-7000, 2017. https://doi.org/10.1109/ACCESS.2017.2696119
  2. S. J. D. Lawrence, D. G. Norris, and F. P de Lange, "Dissociable laminar profiles of concurrent bottom-up and top-down modulation in the human visual cortex," Elife, 2019.
  3. L. Isik, E. M. Meyers, J. Z. Leibo, and T. Poggio, "The dynamics of invariant object recognition in the human visual system," Journal of Neurophysiology, vol. 111, no. 1, pp. 91-102, 2014. https://doi.org/10.1152/jn.00394.2013
  4. C. S. Konen and S. Kastner, "Two hierarchically organized neural systems for object information in human visual cortex," Nature neuroscience, vol. 11, pp. 224-231, 2008. https://doi.org/10.1038/nn2036
  5. R. F. Schwarzlose, J. D. Swisher, S. Dang, and N. Kanwisher, "The distribution of category and location information across object-selective regions in human visual cortex," National Academy of Sciences, vol. 105, no. 11, pp. 4447-4452, 2008. https://doi.org/10.1073/pnas.0800431105
  6. L. Itti and C. Koch, "Learning to detect salient objects in natural scenes using visual attention," in Proc. of Image Understanding Workshop, pp. 1201-1206, 1999.
  7. Y. Wu, N. Zheng, Z. Yuan, H. Jiang, and T. Liu, "Detection of salient objects with focused attention based on spatial and temporal coherence," Chinese Science Bulletin, vol. 56, pp. 1055-1062, 2011. https://doi.org/10.1007/s11434-010-4387-1
  8. B. Schauerte and G. A. Fink, "Focusing computational visual attention in multi-modal humanrobot interaction," in Proc. of International conference on multimodal interfaces and the workshop on machine learning for multimodal interaction, vol. 6, pp. 1-8, 2010.
  9. Y. Wang, X. Zhao, X. Hu, Y. Li, and K. Huang, "Focal boundary guided salient object detection," IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2813-2824, 2019. https://doi.org/10.1109/tip.2019.2891055
  10. H. Alipour, F. Towhidkhah, S. Jafari, A. Menon, and H. Namazi, "Complexity-based analysis of the relation between fractal visual stimuli and fractal eye movements," Fluctuation and Noise Letters, vol. 18, no. 3, 2019.
  11. M. Costa, L. Bonetti, V. Vignali, A. Bichicchi, C. Lantieri, and A. Simone, "Driver's visual attention to different categories of roadside advertising signs," Applied ergonomics, vol. 78, pp. 127-136, 2019. https://doi.org/10.1016/j.apergo.2019.03.001
  12. A. W. Toga, R. L. Goo, R. Murphy, and R. C. Collins, "Neuroscience application of interactive image analysis," Optical Engineering, vol. 23, no. 3, 1984.
  13. D. D. Cox and T. Dean, "Neural networks and neuroscience-inspired computer vision," Current Biology, vol. 24, no. 18, 2014.
  14. B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, "Semantic understanding of scenes through the ade20k dataset," International Journal of Computer Vision, vol. 127, pp. 302-321, 2019. https://doi.org/10.1007/s11263-018-1140-0
  15. H. A. Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, and C. Rother, "Augmented reality meets computer vision: Efficient data generation for urban driving scenes," International Journal of Computer Vision, vol. 129, pp. 961-972, 2018.
  16. T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, "Unified perceptual parsing for scene understanding," in Proc. of the European Conference on Computer Vision (ECCV), pp. 418-434, 2018.
  17. Z. Li, N. Teng, M. Jin, and H. Lu, "Building efficient CNN architecture for offline handwritten Chinese character recognition," International Journal on Document Analysis and Recognition (IJDAR), vol. 21, no. 4, pp. 233-240, 2018. https://doi.org/10.1007/s10032-018-0311-4
  18. J. H. Cho and C. G. Park, "Additional feature CNN based automatic target recognition in SAR image," in Proc. of 2017 Fourth Asian Conference on Defence Technology(ACDT), pp. 1-4, 2017.
  19. Y. Lavinia, H. H. Vo, and A. Verma, "Fusion based deep CNN for improved large-scale image action recognition," in Proc. of 2016 IEEE International Symposium on Multimedia (ISM), pp. 609-614, 2016.
  20. S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, and M. Slaney, "CNN architectures for large-scale audio classification," in Proc. of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 131-135, 2017.
  21. T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, "Convolutional, long short-term memory, fully connected deep neural networks," in Proc. of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4580-4584, 2015.
  22. P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu, "Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling," arXiv preprint arXiv:1611.06639, 2016.
  23. A. Kasagi, T. Tabaru, and H. Tamura, "Fast algorithm using summed area tables with unified layer performing convolution and average pooling," in Proc. of 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing, pp. 1-6, 2017.
  24. M. Rezaei, H. Yang, and C. Meinel, "Deep neural network with l2-norm unit for brain lesions detection," in Proc. of International Conference on Neural Information Processing, pp. 798-807, 2017.
  25. M. Kuchnik and V. Smith, "Efficient augmentation via data subsampling," arXiv preprint arXiv:1810.05222, 2018.
  26. W. Liu, Y. Wen, Z. Yu, and M. Yang, "Large-margin softmax loss for convolutional neural networks," arXiv:1612.02295, 2016.
  27. X. Liang, X. Wang, Z. Lei, S. Liao, and S. Z. Li, "Soft-margin softmax for deep classification," in Proc. of International Conference on Neural Information Processing, pp. 413-421, 2017.
  28. D. Marin, Z. He, P. Vajda, P. Chatterjee, S. Tsai, F. Yang, and Y. Boykov, "Efficient segmentation: Learning downsampling near semantic boundaries," in Proc. of the IEEE International Conference on Computer Vision, pp. 2131-2141, 2019.
  29. K. Bernardin and R. Stiefelhagen, "Evaluating multiple object tracking performance: the CLEAR MOT metrics," EURASIP Journal on Image and Video Processing, 2008.
  30. A. Borji, M. M. Cheng, H. Jiang, and J. Li, "Salient object detection: A benchmark," IEEE transactions on image processing, vol. 24, no. 12, pp. 5706-5722, 2015. https://doi.org/10.1109/TIP.2015.2487833
  31. P. Kapsalas, K. Rapantzikos, A. Sofou, and Y. Avrithis, "Regions of interest for accurate object detection," in Proc. of 2008 International Workshop on Content-Based Multimedia Indexing, pp. 147-154, 2008.
  32. A. J. Fredo, R. S. Abilash, R. Femi, A. Mythili, and C. S. Kumar, "Classification of damages in composite images using Zernike moments and support vector machines," Composites Part B: Engineering, vol. 168, pp. 77-86, 2019. https://doi.org/10.1016/j.compositesb.2018.12.064
  33. J. F. Canny, "A Variational Approach to Edge Detection," AAAI-83 Proceedings, pp. 54-58, 1983.
  34. M. Sharifi, M. Fathy, and M. T. Mahmoudi, "A classified and comparative study of edge detection algorithms," in Proc. of International conference on information technology: Coding and computing, pp. 117-120, 2002.