[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.15701/kcgs.2017.23.3.65

Natural Photography Generation with Text Guidance from Spherical Panorama Image

Kim, Beomseok (POSTECH)
Jung, Jinwoong (POSTECH)
Hong, Eunbin (POSTECH)
Cho, Sunghyun (DGIST)
Lee, Seungyong (POSTECH)

Publication Information

Journal of the Korea Computer Graphics Society / v.23, no.3, 2017 , pp. 65-75 More about this Journal

Abstract

As a 360-degree image carries information of all directions, it often has too much information. Moreover, in order to investigate a 360-degree image on a 2D display, a user has to either click and drag the image with a mouse, or project it to a 2D panorama image, which inevitably introduces severe distortions. In consequence, investigating a 360-degree image and finding an object of interest in such a 360-degree image could be a tedious task. To resolve this issue, this paper proposes a method to find a region of interest and produces a 2D naturally looking image from a given 360-degree image that best matches a description given by a user in a natural language sentence. Our method also considers photo composition so that the resulting image is aesthetically pleasing. Our method first converts a 360-degree image to a 2D cubemap. As objects in a 360-degree image may appear distorted or split into multiple pieces in a typical cubemap, leading to failure of detection of such objects, we introduce a modified cubemap. Then our method applies a Long Short Term Memory (LSTM) network based object detection method to find a region of interest with a given natural language sentence. Finally, our method produces an image that contains the detected region, and also has aesthetically pleasing composition.

Keywords

360 image; deep learning; natural language processing; LSTM; photo composition;

Citations & Related Records

Reference

1	R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.
2	R. Girshick, "Fast r-cnn," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440-1448.
3	J. Dai, "R-FCN: Object detection via region-based fully convolutional networks," arXiv preprint arXiv:1605.06409, 2016.
4	J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625-2634.
5	R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, "Natural language object retrieval," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4555-4564.
6	J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, "Selective search for object recognition," International Journal of Computer Vision, vol. 104, no. 2, pp. 154-171, 2013. DOI
7	L. Liu, R. Chen, L. Wolf, and D. Cohen-Or, "Optimizing photo composition," in Computer Graphics Forum, vol. 29, no. 2. Wiley Online Library, 2010, pp. 469-478. DOI
8	C. L. Zitnick and P. Dollar, "Edge boxes: Locating object proposals from edges," in European Conference on Computer Vision. Springer, 2014, pp. 391-405.
9	S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," in Advances in Neural Information Processing Systems, 2015, pp. 91-99.
10	O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156-3164.
11	K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
12	K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, "Show, attend and tell: Neural image caption generation with visual attention," in International Conference on Machine Learning, 2015, pp. 2048-2057.
13	J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, "Deep captioning with multimodal recurrent neural networks (m-rnn)," arXiv preprint arXiv:1412.6632, 2014.
14	S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, "Person search with natural language description," arXiv preprint arXiv:1702.05729, 2017.
15	M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, "Bing: Binarized normed gradients for objectness estimation at 300fps," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3286-3293.
16	O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., "Imagenet large scale visual recognition challenge," International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015. DOI
17	T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft coco: Common objects in context," in European Conference on Computer Vision. Springer, 2014, pp. 740-755.
18	S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg, "Referitgame: Referring to objects in photographs of natural scenes." in EMNLP, 2014, pp. 787-798.
19	J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, "Recognizing scene viewpoint using panoramic place representation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2695-2702.

KSCI

Natural Photography Generation with Text Guidance from Spherical Panorama Image 360 영상으로부터 텍스트 정보를 이용한 자연스러운 사진 생성

Natural Photography Generation with Text Guidance from Spherical Panorama Image