Browse > Article
http://dx.doi.org/10.4218/etrij.2018-0621

Image classification and captioning model considering a CAM-based disagreement loss  

Yoon, Yeo Chan (SW Content Research Laboratory, Electronics and Technology Research Institute)
Park, So Young (Department of Game Design and Development, Sangmyung University)
Park, Soo Myoung (SW Content Research Laboratory, Electronics and Technology Research Institute)
Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)
Publication Information
ETRI Journal / v.42, no.1, 2020 , pp. 67-77 More about this Journal
Abstract
Image captioning has received significant interest in recent years, and notable results have been achieved. Most previous approaches have focused on generating visual descriptions from images, whereas a few approaches have exploited visual descriptions for image classification. This study demonstrates that a good performance can be achieved for both description generation and image classification through an end-to-end joint learning approach with a loss function, which encourages each task to reach a consensus. When given images and visual descriptions, the proposed model learns a multimodal intermediate embedding, which can represent both the textual and visual characteristics of an object. The performance can be improved for both tasks by sharing the multimodal embedding. Through a novel loss function based on class activation mapping, which localizes the discriminative image region of a model, we achieve a higher score when the captioning and classification model reaches a consensus on the key parts of the object. Using the proposed model, we established a substantially improved performance for each task on the UCSD Birds and Oxford Flowers datasets.
Keywords
deep learning; image captioning; image classification;
Citations & Related Records
연도 인용수 순위
  • Reference
1 W. Qi et al., Image captioning and visual question answering based on attributes and external knowledge, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2018), no. 6, 1367-1381.   DOI
2 Y. Youngjae et al., End-to-end concept word detection for video captioning, retrieval, and question answering in IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3261-3269.
3 P. Anderson et al., Bottom-up and top-down attention for image captioning and VQA, arXiv preprint arXiv: 1707.07998, 2017.
4 L. Jiasen et al., Knowing when to look: Adaptive attention via a visual sentinel for image captioning, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3242-3250.
5 T. Yao et al., Boosting image captioning with attributes, in IEEE Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 22-29.
6 C. Wang, H. Yang, and C. Meinel, Image captioning with deep bidirectional lstms and multi-task learning, ACM Trans. Multimedia Comput., Commun., Applicat., 14 (2018), no. 2s, 1-20.
7 C. Szegedy et al., Going deeper with convolutions, in Proc. IEEE Conf. Computer Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 1-9.
8 S. Reed et al., Learning deep representations of fine-grained visual descriptions, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 49-58.
9 L. Zhang et al., Learning a deep embedding model for zero-shot learning, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3010-3019.
10 X. He and Y. Peng, Fine-grained image classification via combining vision and language, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 7332-7340.
11 R. Kiros, R. Salakhutdinov, and R.S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv preprint arXiv: abs/1411.2539, 2014.
12 J. Mao et al., Learning like a child: Fast novel visual concept learning from sentence descriptions of images, in Proc. IEEE Int. Conf. Comput. Vision, Santiago, Chile, 2015, pp. 2533-2541.
13 R. Vedantam et al., Context-aware captions from context-agnostic supervision, in Proc. IEEE, Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 1070-1079.
14 A.H. Abdulnabi et al., Multi-task CNN model for attribute prediction, IEEE Trans. Multimedia 17 (2015), no. 11, 1949-1959.   DOI
15 T.-H. Chen et al., Show adapt and tell: Adversarial training of cross-domain image captioner, in IEEE, Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 521-530.
16 R.R. Selvaraju et al., Grad-CAM: Visual explanations from deep networks via gradient-based localization, in IEEE Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 618-626.
17 Y.-C. Yoon et al., Fine-grained mobile application clustering model using retrofitted document embedding, ETRI J. 39 (2017), no. 4, 443-454.   DOI
18 S. Kong and C. Fowlkes, Low-rank bilinear pooling for fine-grained classification, in IEEE Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 7025-7034.
19 X.-S. Wei et al., Selective convolutional descriptor aggregation for fine-grained image retrieval, IEEE Trans. Image Process. 26 (2017), no. 6, 2868-2881.   DOI
20 Y. Shaoyong et al., A model for fine-grained vehicle classification based on deep learning, Neurocomput. 257 (2017), 97-103.   DOI
21 G.-S. Xie et al., LG-CNN: from local parts to global discrimination for fine-grained recognition, Pattern Recogn. 71 (2017), 118-131.   DOI
22 S.H. Lee, HGO-CNN: Hybrid generic-organ convolutional neural network for multi-organ plant classification, in IEEE Int. Conf. Image Process., Beijing, China, Sept. 2017, pp. 4462-4466.
23 A. Li et al., Zero-shot fine-grained classification by deep feature learning with semantics, arXiv preprint arXiv: abs/1707.00785, 2017.
24 Z. Akata et al., Evaluation of output embeddings for fine-grained image classification, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 2927-2936.
25 R. Ranjan, V. M. Patel, and R. Chellappa, Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2018), 121-135.   DOI
26 K. Hashimoto et al., A joint many-task model: Growing a neural network for multiple NLP tasks, arXiv preprint arXiv: abs/1611.01587, 2016.
27 R. Caruana, Multitask learning: a knowledge-based source of inductive bias, in Proc. Int. Conf. Mach. Learn., Amherst, MA, USA, June 1993, pp. 41 - 48.
28 C. Wah et al., The Caltech-UCSD Birds-200-2011 Dataset, Tech. Report CNS-TR-2011-001, California Institute of Technology, 2011.
29 L. Duong et al., Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser, in Proc. Annu. Meeting Association Computat. Linguistics Int. Joint Conf. Natural Language Process., Beijing, China, July 2015, pp. 845-850.
30 M. Nilsback and A. Zisserman, Automated flower classification over a large number of classes, in Proc. Indian Conf. Comput. Vision, Graphics Image Process., Bhubaneswar, India, Dec. 2008, pp. 722-729.
31 K. Papineni et al., Bleu: A method for automatic evaluation of machine translation, in Proc. Annu. Meeting Association Computat. Linguistics, Philadelphia, PA, USA, July 2002, pp. 311-318.
32 C.-Y. Lin, Rouge: a package for automatic evaluation of summaries, in Workshop Text Summarization Branches Out, Post-Conf. Workshop ACL, Barcelona, Spain, July 2004, pp. 74-81.
33 S. Banerjee and A. Lavie, Meteor: an automatic metric for MT evaluation with improved correlation with human judgments, in Proc. ACL Workshop Intrinsic Extrinsic Evaluation Measures Mach. Translation Summarization, Ann Arbor, MI, USA, 2005, pp. 65-72.
34 R. Lawrence, C.L. Zitnick, and D. Parikh, Cider: Consensus-based image description evaluation, arXiv preprint arXiv: abs/1411.5726 (2014).
35 C. Szegedy, S. Ioffe, and V. Vanhoucke, Inception-v4, Inception-Resnet and the impact of residual connections on learning, in Proc. AAAI Conf. Artif. Intell., San Francisco, CA, USA, Feb. 2017, pp. 2478-4284.
36 L.A. Hendricks et al., Generating visual explanations, in Eur. Conf. Comput. Vision, Amsterdam, The Netherlands, Oct. 2016, pp. 3-19.
37 A. Paszke et al., Automatic differentiation in PyTorch, in Proc. NIPS, Long Beach, CA, USA, 2017.
38 J. Donahue et al., Long-term recurrent convolutional networks for visual recognition and description, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 2625-2634.
39 O. Vinyals et al., Show and tell: A neural image caption generator, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 3156-3164.
40 Y. Dong et al., Improving interpretability of deep neural networks with semantic information, arXiv preprint arXiv: 1703.04096 (2017), 3-19.
41 L.A. Hendricks et al., Deep compositional captioning: Describing novel object categories without paired training data, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 1-10.
42 Q. You et al., Image captioning with semantic attention, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 4651-4659.
43 S.J. Rennie et al., Self-critical sequence training for image captioning, in IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 1179-1195.