ASPPMVSNet: A high-receptive-field multiview stereo network for dense three-dimensional reconstruction

  • Saleh, Saeed (Department of Computer Science and Engineering, Sogang University) ;
  • Sungjun, Lee (Immersive Media Section, Electronics and Telecommunications Research Institute) ;
  • Yongju, Cho (Department of Computer Science and Engineering, Sogang University) ;
  • Unsang, Park (Department of Computer Science and Engineering, Sogang University)
  • 투고 : 2021.08.31
  • 심사 : 2022.03.29
  • 발행 : 2022.12.10


The learning-based multiview stereo (MVS) methods for three-dimensional (3D) reconstruction generally use 3D volumes for depth inference. The quality of the reconstructed depth maps and the corresponding point clouds is directly influenced by the spatial resolution of the 3D volume. Consequently, these methods produce point clouds with sparse local regions because of the lack of the memory required to encode a high volume of information. Here, we apply the atrous spatial pyramid pooling (ASPP) module in MVS methods to obtain dense feature maps with multiscale, long-range, contextual information using high receptive fields. For a given 3D volume with the same spatial resolution as that in the MVS methods, the dense feature maps from the ASPP module encoded with superior information can produce dense point clouds without a high memory footprint. Furthermore, we propose a 3D loss for training the MVS networks, which improves the predicted depth values by 24.44%. The ASPP module provides state-of-the-art qualitative results by constructing relatively dense point clouds, which improves the DTU MVS dataset benchmarks by 2.25% compared with those achieved in the previous MVS methods.



This work was supported by the Electronics and Telecommunications Research Institute (ETRI) grant by the Korean government (22ZH1210, fundamental media contents technologies for hyper-realistic media space).


  1. J. L. Schonberger and J.-M. Frahm, Structure-from-motion revisited, (IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA), June 2016.
  2. Y. Furukawa and C. Hernandez, Multi-View Stereo: A Tutorial, CGV, 9 (2015), no. 1-2, 1-148.
  3. X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan, Cascade cost volume for high-resolution multi-view stereo and stereo matching, (IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA), June 2020.
  4. S. Im, H. G. Jeon, S. Lin, and I. S. Kweon, DPSNet: End-to-End Deep Plane Sweep Stereo, arXiv preprint, May 2019.
  5. Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan, MVSNet: Depth inference for unstructured multi-view stereo, (European Conference Computer Vision), Munich, Germany, 2018.
  6. K. Luo, T. Guan, L. Ju, H. Huang, and Y. Luo, P-MVSNet: Learning patch-wise matching confidence aggregation for multi-view stereo, (IEEE/CVF International Conference on Computer Vision, Seoul, Rep. of Korea), 2019.
  7. Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan, Recurrent MVSNet for high-resolution multi-view stereo depth inference, (IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA), 2019.
  8. R. Chen, S. Han, J. Xu, and H. Su, Point-based multi-view stereo network, (IEEE/CVF International Conference on Computer Vision, Seoul, Rep. of Korea), 2019.
  9. Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan, Object region mining with adversarial erasing: A simple classification to semantic segmentation approach, (IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017.
  10. Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao, and S. Yan, STC: A simple to complex framework for weakly supervised semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017), no. 11, 2314-2320.
  11. L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, Attention to scale: Scale-aware semantic image segmentation, (IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas NV, USA), 2016.
  12. G. Papandreou, I. Kokkinos, and P.-A. Savalle, Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection, (IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA), 2015.
  13. I. Kreso, I. Kreso, D. Causevic, J. Krapac, and S. Segvic, Convolutional scale invariance for semantic segmentation, (Conference Proceedings Pattern Recognition, Hannover, Germany), 2016.
  14. I. Kokkinos, Pushing the boundaries of boundary detection using deep learning, arXiv Preprint, Jan. 2016.
  15. G. Ghiasi and C. C. Fowlkes, Laplacian pyramid reconstruction and refinement for semantic segmentation, (Proc. European Conference on Computer Vision, Amsterdam, Netherlands), Oct. 2016.
  16. J. Cao, Y. Pang, and X. Li, Triply supervised decoder networks for joint detection and segmentation, (IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA), 2019.
  17. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2018), no. 4, 834-848.
  18. X. Lian, Y. Pang, J. Han, and J. Pan, Cascaded hierarchical atrous spatial pyramid pooling module for semantic segmentation, Pattern Recognit. 110 (2021), 107622.
  19. L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv Preprint, Dec. 2017.
  20. J. L. Schonberger, E. Zheng, J. M. Frahm, and M. Pollefeys, Pixelwise view selection for unstructured multi-view stereo, (Proc. European Conference on Computer Vision, Amsterdam, Netherlands), Oct. 2016.
  21. K. N. Kutulakos and S. M. Seitz, A theory of shape by space carving, (Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece), Sept. 1999.
  22. A. Kar, C. Hane, and J. Malik, Learning a multi-view stereo machine, arXiv preprint, Aug. 2017.
  23. S. M. Seitz and C. R. Dyer, Photorealistic Scene Reconstruction by Voxel Coloring US Patent 6363170B1, filed Apr, vol. 29, issued Mar. 26, 2002. 1999.
  24. M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang, SurfaceNet: An end-to-end 3D neural network for multiview stereopsis, (IEEE International Conference on Computer Vision, Venice, Italy), Oct. 2017.
  25. M. Lhuillier and L. Quan, A quasi-dense approach to surface reconstruction from uncalibrated images, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005), no. 3, 418-433.
  26. Y. Furukawa and J. Ponce, Accurate, dense, and robust multiview stereopsis, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010), no. 8, 1362-1376.
  27. E. Tola, C. Strecha, and P. Fua, Efficient large-scale multi-view stereo for ultra high-resolution image sets, Mach. Vis. Appl. 23 (2012), no. 5, 903-920.
  28. S. Galliani, K. Lasinger, and K. Schindler, Massively parallel multiview stereopsis by surface normal diffusion, (IEEE International Conference on Computer Vision, Santiago, Chile), Dec. 2015.
  29. Y. Yao, S. Li, S. Zhu, H. Deng, T. Fang, and L. Quan, Relative camera refinement for accurate dense reconstruction, (International Conference on 3D Vision, Qingdao, China), Oct. 2017.
  30. A. Romanoni and M. Matteucci, TAPA-MVS: textureless-aware PAtchMatch multi-view stereo, (IEEE/CVF International Conference on Computer Vision, Seoul, Rep. of Korea), 2019.
  31. N. D. F. Campbell, G. Vogiatzis, C. Hernandez, and R. Cipolla, Using multiple hypotheses to improve depth-maps for multi-view stereo, (European Conference on Computer Vision, Marseille, France), 2008.
  32. R. Zhang, S. Zhu, T. Fang, and L. Quan, Distributed very large scale bundle adjustment by global camera consensus, (IEEE International Conference on Computer Vision, Venice, Italy), Oct. 2017.
  33. S. Zhu, R. Zhang, L. Zhou, T. Shen, T. Fang, P. Tan, and L. Quan, Very large-scale global SfM by distributed motion averaging, (IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA), 2018.
  34. R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanaes, Large scale multi-view stereopsis evaluation, (IEEE Conference on Computer Vision and Pattern Recognition, Columbus OH, USA), 2014.
  35. H. Aanaes, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl, Large-scale data for multiple-view stereopsis, Int. J. Comput. Vision, 120 (2016), no. 2, 153-168.
  36. P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong, O-CNN: Octree-based convolutional neural networks for 3D shape analysis, ACM Trans. Graph. 36 (2017), no. 4, 1-11.
  37. G. Riegler, A. O. Ulusoy, and A. Geiger, OctNet: Learning deep 3D representations at high resolutions, (IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017.
  38. C. Farabet, C. Couprie, L. Najman, and Y. LeCun, Learning hierarchical features for scene labeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013), no. 8, 1915-1929.
  39. D. Eigen and R. Fergus, Predicting depth, surface normal, and semantic labels with a common multi-scale convolutional architecture, (IEEE International Conference on Computer Vision, Santiago, Chile), 2015.
  40. P. Pinheiro and R. Collobert, Recurrent convolutional neural networks for scene labeling, (Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China), June 2014, pp. 82-90.
  41. G. Lin, C. Shen, A. van den Hengel, and I. Reid, Efficient piecewise training of deep structured models for semantic segmentation, (IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas NV, USA), 2016.
  42. V. Badrinarayanan, A. Kendall, and R. Cipolla, SegNet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017), no. 12, 2481-2495.
  43. O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional networks for biomedical image segmentation, (International Conference Medical Image Computing and Computer-Assisted Intervention, Munich, Germany), Oct. 2015.
  44. G. Ghiasi and C. C. Fowlkes, Laplacian pyramid reconstruction and refinement for semantic segmentation, arXiv Preprint, 2016.
  45. G. Lin, A. Milan, C. Shen, and I. Reid, RefineNet: Multi-path refinement networks for high-resolution semantic segmentation, (IEEE Conference on Computer Vision and Pattern Recognition, Hololulu, HI, USA), 2017.
  46. T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, Full-resolution residual networks for semantic segmentation in street scenes, (IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017.
  47. C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, Large kernel matters - Improve semantic segmentation by global convolutional network, (IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017.
  48. M. A. Islam, M. Rochan, N. D. B. Bruce, and Y. Wang, Gated feedback refinement network for dense image labeling, (IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017.
  49. P. Krahenbuhl and V. Koltun, Efficient inference in fully connected CRFs with Gaussian edge potentials, Neural Inform Process. Syst. 24 (2011), 109-117.
  50. L.-C. Chen, Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, (International Conference on Learning Representations, San Diego, CA, USA), May 2015.
  51. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr, Conditional random fields as recurrent neural networks, (IEEE International Conference on Computer Vision, Santiago, Chile), Dec. 2015.
  52. A. G. Schwing, and R. Urtasun, Fully Connected deep structured networks, arXiv preprint, 2015.
  53. Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang, Semantic image segmentation via deep parsing network, (IEEE International Conference on Computer Vision, Santiago, Chile), 2015.
  54. F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint, ICLR, 2016.
  55. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, Pyramid scene parsing network, (IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017.
  56. Z. Wei, H. Yi, M. Ding, R. Zhang, Y. Chen, G. Wang, and Y.-W. Tai, Dense hybrid recurrent multi-view stereo net with dynamic consistency checking, (ECCV 2020: 16th European Conference, Glasgow, UK). Aug. 2020.