Browse > Article
http://dx.doi.org/10.15701/kcgs.2017.23.3.65

Natural Photography Generation with Text Guidance from Spherical Panorama Image  

Kim, Beomseok (POSTECH)
Jung, Jinwoong (POSTECH)
Hong, Eunbin (POSTECH)
Cho, Sunghyun (DGIST)
Lee, Seungyong (POSTECH)
Abstract
As a 360-degree image carries information of all directions, it often has too much information. Moreover, in order to investigate a 360-degree image on a 2D display, a user has to either click and drag the image with a mouse, or project it to a 2D panorama image, which inevitably introduces severe distortions. In consequence, investigating a 360-degree image and finding an object of interest in such a 360-degree image could be a tedious task. To resolve this issue, this paper proposes a method to find a region of interest and produces a 2D naturally looking image from a given 360-degree image that best matches a description given by a user in a natural language sentence. Our method also considers photo composition so that the resulting image is aesthetically pleasing. Our method first converts a 360-degree image to a 2D cubemap. As objects in a 360-degree image may appear distorted or split into multiple pieces in a typical cubemap, leading to failure of detection of such objects, we introduce a modified cubemap. Then our method applies a Long Short Term Memory (LSTM) network based object detection method to find a region of interest with a given natural language sentence. Finally, our method produces an image that contains the detected region, and also has aesthetically pleasing composition.
Keywords
360 image; deep learning; natural language processing; LSTM; photo composition;
Citations & Related Records
연도 인용수 순위
  • Reference
1 R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.
2 R. Girshick, "Fast r-cnn," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440-1448.
3 J. Dai, "R-FCN: Object detection via region-based fully convolutional networks," arXiv preprint arXiv:1605.06409, 2016.
4 J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625-2634.
5 R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, "Natural language object retrieval," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4555-4564.
6 J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, "Selective search for object recognition," International Journal of Computer Vision, vol. 104, no. 2, pp. 154-171, 2013.   DOI
7 L. Liu, R. Chen, L. Wolf, and D. Cohen-Or, "Optimizing photo composition," in Computer Graphics Forum, vol. 29, no. 2. Wiley Online Library, 2010, pp. 469-478.   DOI
8 C. L. Zitnick and P. Dollar, "Edge boxes: Locating object proposals from edges," in European Conference on Computer Vision. Springer, 2014, pp. 391-405.
9 S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," in Advances in Neural Information Processing Systems, 2015, pp. 91-99.
10 O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156-3164.
11 K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
12 K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, "Show, attend and tell: Neural image caption generation with visual attention," in International Conference on Machine Learning, 2015, pp. 2048-2057.
13 J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, "Deep captioning with multimodal recurrent neural networks (m-rnn)," arXiv preprint arXiv:1412.6632, 2014.
14 S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, "Person search with natural language description," arXiv preprint arXiv:1702.05729, 2017.
15 M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, "Bing: Binarized normed gradients for objectness estimation at 300fps," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286-3293.
16 O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., "Imagenet large scale visual recognition challenge," International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015.   DOI
17 T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft coco: Common objects in context," in European Conference on Computer Vision. Springer, 2014, pp. 740-755.
18 S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg, "Referitgame: Referring to objects in photographs of natural scenes." in EMNLP, 2014, pp. 787-798.
19 J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, "Recognizing scene viewpoint using panoramic place representation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2695-2702.