CRFNet: Context ReFinement Network used for semantic segmentation

Taeghyun An;Jungyu Kang;Dooseop Choi;Kyoung-Wook Min;

doi:10.4218/etrij.2023-0017

ETRI Journal

Volume 45 Issue 5
/
Pages.822-835
/
2023
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

CRFNet: Context ReFinement Network used for semantic segmentation

Taeghyun An (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Jungyu Kang (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Dooseop Choi (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Kyoung-Wook Min (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)

Received : 2023.01.18
Accepted : 2023.08.09
Published : 2023.10.20

https://doi.org/10.4218/etrij.2023-0017 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Recent semantic segmentation frameworks usually combine low-level and high-level context information to achieve improved performance. In addition, postlevel context information is also considered. In this study, we present a Context ReFinement Network (CRFNet) and its training method to improve the semantic predictions of segmentation models of the encoder-decoder structure. Our study is based on postprocessing, which directly considers the relationship between spatially neighboring pixels of a label map, such as Markov and conditional random fields. CRFNet comprises two modules: a refiner and a combiner that, respectively, refine the context information from the output features of the conventional semantic segmentation network model and combine the refined features with the intermediate features from the decoding process of the segmentation model to produce the final output. To train CRFNet to refine the semantic predictions more accurately, we proposed a sequential training scheme. Using various backbone networks (ENet, ERFNet, and HyperSeg), we extensively evaluated our model on three large-scale, real-world datasets to demonstrate the effectiveness of our approach.

Keywords

Acknowledgement

This research work was partly supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00002, Development of standard SW platform-based autonomous driving technology to solve social problems of mobility and safety for public transport-marginalized communities, contribution rate: 50%) and by an IITP grant funded by the Korean government (MSIT) (No. 2021-0-00891, Development of AI Service Integrated Framework for Autonomous Driving, contribution rate: 50%).

References

J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA), 2015. https://doi.org/10.1109/CVPR.2015.7298965
N. Hyeonwoo, S. Hong, and B. Han, Learning deconvolution network for semantic segmentation, (IEEE International Conference on Computer Vision, Santiago, Chile) 2015. https://doi.org/10.1109/ICCV.2015.178
C. Farabet, C. Couprie, L. Najman, and Y. LeCun, Learning hierarchical features for scene labeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2012), no. 8, 1915-1929.
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017. https://doi.org/10.1109/CVPR.2017.660
O. Marin and S. Sinisa, Efficient semantic segmentation with pyramidal fusion, Pattern Recognition 110 (2021). https://doi.org/10.1016/j.patcog.2020.107611
Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, Structured knowledge distillation for semantic segmentation, (IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA), 2019. https://doi.org/10.1109/CVPR.2019.00271
R. Sun, X. Zhu, C. Wu, C. Huang, J. Shi, and L. Ma, Not all areas are equal: Transfer learning for semantic segmentation via hierarchical region selection, (IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA), 2019. https://doi.org/10.1109/CVPR.2019.00449
A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, Panoptic segmentation, (IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA), 2019. https://doi.org/10.1109/CVPR.2019.00963
Y. Zeng, Y. Zhuge, H. Lu, and L. Zhang, Joint learning of saliency detection and weakly supervised semantic segmentation, (IEEE/CVF International Conference on Computer Vision, Seoul, Rep. of Korea), 2019. https://doi.org/10.1109/ICCV.2019.00732
P. Chen Liang-Chieh, Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2017), no. 5, 834-848.
Y. Fisher and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint, 2015. https://doi.org/10.48550/arXiv.1511.07122
G. Ghiasi and C. C. Fowlkes, Laplacian pyramidal reconstruction and refinement for semantic segmentation, (European Conference on Computer Vision, Amsterdam, The Netherlands), 2016, pp. 519-534.
K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015), no. 9, 1904-1916. https://doi.org/10.1109/TPAMI.2015.2389824
J. Lafferty, A. McCallum, and F. C. N. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, (ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning), 2021, pp. 282-289.
J. Ji, R. Shi, S. Li, P. Chen, and Q. Miao, Encoder-decoder with cascaded CRFs for semantic segmentation, IEEE Trans. Circ. Syst. Video Technol. 31 (2020), no. 5, 1926-1938.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, The cityscapes dataset for semantic urban scene understanding, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA) 2016. https://doi.org/10.1109/CVPR.2016.350
G. J. Brostow, J. Fauqueur, and R. Cipolla, Semantic object classes in video: high-definition ground-truth database, Pattern Recognit. Lett. 30 (2009), no. 2, 88-97.
L. Yiyi, J. Xie, and A. Geiger, KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2D and 3D, IEEE Trans. Pattern Anal. Mach. Intell. 45 (2022), no. 3, 3292-3310.
O. Ronneberger, P. Fischer, and T. Brox, U-net: convolutional networks for biomedical image segmentation, (International Conference on Medical Image Computing and Computer-Assisted Interventions, Munich, Germany), 2015, pp. 234-241.
L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation, (European Conference on Computer Vision (ECCV), Munich, Germany), 2018, pp. 833-851.
L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, Attention to scale: scale-aware semantic image segmentation, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las Begas, NV, USA), 2016. https://doi.org/10.1109/CVPR.2016.396
F. Yu, V. Koltun, and T. Funkhouser, Dilated residual networks, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017. https://doi.org/10.1109/CVPR.2017.75
M. Rohit and A. Valada, EfficientPS: efficient panoptic segmentation, Int. J. Comput. Vision 129 (2021), no. 5, 1551-1579. https://doi.org/10.1007/s11263-021-01445-z
Y. Yuan, X. Chen, and J. Wang, Object-contextual representations for semantic segmentation, (European Conference on Computer Vision, Glasgow, UK), 2020, pp. 173-190.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, and A. N. Gomez, Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017).
H. Jie, S. Li, and G. Sun, Squeeze-and-excitation networks, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA), 2018. https://doi.org/10.1109/CVPR.2018.00745
B. Cheng, L. C. Chen, Y. Wei, Y. Zhu, Z. Huang, J. Xiong, T. S. Huang, W. M. Hwu, and H. Shi, SPGNet: semantic prediction guidance for scene parsing, (IEEE/CVF International Conference on Computer Vision, Seoul, Rep. of Korea), 2019. https://doi.org/10.1109/ICCV.2019.00532
Y. Nirkin, L. Wolf, and T. Hassner, Patch-wise hypernetwork for real-time semantic segmentation, (IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA), 2021. https://doi.org/10.1109/CVPR46437.2021.00405
E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation, IEEE Trans. Intell. Trans.ort. Syst. 19 (2017), no. 1, 263-272.
Y. Changqian, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, BiSeNet: bilateral segmentation network for real-time semantic segmentation, (European Conference on Computer Vision (ECCV)), 2018, pp. 334-349.
S. Gould, R. Fulton, and D. Koller, Decomposing a scene into geometrically and semantically consistent regions, (IEEE 12th International Conference on Computer Vision, Kyoto, Japan), 2009. https://doi.org/10.1109/ICCV.2009.5459211
L. U. Ladicka, C. Russell, P. Kohli, and P. H. Torr, Associative hierarchical CRFs for object class image segmentation, (IEEE 12th International Conference on Computer Vision, Kyoto, Japan), 2009. https://doi.org/10.1109/ICCV.2009.5459248
Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, Deep learning Markov random field for semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell 40 (2017), no. 8, 1814-1828.
S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, Conditional random fields as recurrent neural networks, (IEEE International Conference on Computer Vision, Santiago, Chile), 2015. https://doi.org/10.1109/ICCV.2015.179
T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017. https://doi.org/10.1109/CVPR.2017.106
G. Lin, A. Milan, C. Shen, and I. Reid, RefineNet: multi-path refinement networks for high-resolution semantic segmentation, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017. https://doi.org/10.1109/CVPR.2017.549
F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, M. Matteucci, and A. Courville, ReSeg: a recurrent neural network-based model for semantic segmentation, (IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA), 2016. https://doi.org/10.1109/CVPRW.2016.60
W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki, Scene labeling with LSTM recurrent neural networks, (Proc. IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA), 2015. https://doi.org/10.1109/CVPR.2015.7298977
A. Newell, K. Yang, and J. Deng, Stacked hourglass networks for human pose estimation, (European Conference on Computer Vision, Amsterdam, The Netherlands), 2016, pp. 483-499.
L. Ke, M. C. Chang, H. Qi, and S. Lyu, Multi-scale structure-aware network for human pose estimation, (Proc. European Conference on Computer Vision (ECCV), Munich, Germany), 2018, pp. 731-746.
J. Fu, J. Liu, Y. Wang, J. Zhou, C. Wang, and H. Lu, Stacked deconvolutional network for semantic segmentation, IEEE Trans. Image Process. (2019), 1-1. https://doi.org/10.1109/TIP.2019.2895460
B. Maxim, A. R. Triki, and M. B. Blaschko, The Lovasz-Softmax Loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks, (IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA), 2018. https://doi.org/10.1109/CVPR.2018.00464
H. Pan, Y. Hong, W. Sun, and Y. Jia, Developed deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes, IEEE Trans. Intell. Trans.ort. Syst. 24 (2022), no. 3, 3448-3460.
A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, ENet: a deep neural network architecture for real-time semantic segmentation, arXiv preprint, 2016. https://doi.org/10.48550/arXiv.1606.02147
J. Kang, S. J. Han, N. Kim, and K. W. Min, ETLi: efficiently annotated traffic LiDAR dataset using incremental and suggestive annotations, ETRI J. 43 (2021), no. 4, 630-639. https://doi.org/10.4218/etrij.2021-0055
J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, SemanticKITTI: a dataset for semantic scene understanding of LiDAR sequences, (IEEE/CVF International Conference on Computer Vision, Seoul, Rep of Korea), 2019. https://doi.org/10.1109/ICCV.2019.00939
P. Kingma Diederik and J. Ba, Adam: a method for stochastic optimization, arXiv preprint, 2014. https://doi.org/10.48550/arXiv.1412.6980