Browse > Article
http://dx.doi.org/10.13088/jiis.2020.26.2.079

Deep Learning-based Professional Image Interpretation Using Expertise Transplant  

Kim, Taejin (Graduate School of Business IT, Kookmin University)
Kim, Namgyu (School of Management Information Systems, Kookmin University)
Publication Information
Journal of Intelligence and Information Systems / v.26, no.2, 2020 , pp. 79-104 More about this Journal
Abstract
Recently, as deep learning has attracted attention, the use of deep learning is being considered as a method for solving problems in various fields. In particular, deep learning is known to have excellent performance when applied to applying unstructured data such as text, sound and images, and many studies have proven its effectiveness. Owing to the remarkable development of text and image deep learning technology, interests in image captioning technology and its application is rapidly increasing. Image captioning is a technique that automatically generates relevant captions for a given image by handling both image comprehension and text generation simultaneously. In spite of the high entry barrier of image captioning that analysts should be able to process both image and text data, image captioning has established itself as one of the key fields in the A.I. research owing to its various applicability. In addition, many researches have been conducted to improve the performance of image captioning in various aspects. Recent researches attempt to create advanced captions that can not only describe an image accurately, but also convey the information contained in the image more sophisticatedly. Despite many recent efforts to improve the performance of image captioning, it is difficult to find any researches to interpret images from the perspective of domain experts in each field not from the perspective of the general public. Even for the same image, the part of interests may differ according to the professional field of the person who has encountered the image. Moreover, the way of interpreting and expressing the image also differs according to the level of expertise. The public tends to recognize the image from a holistic and general perspective, that is, from the perspective of identifying the image's constituent objects and their relationships. On the contrary, the domain experts tend to recognize the image by focusing on some specific elements necessary to interpret the given image based on their expertise. It implies that meaningful parts of an image are mutually different depending on viewers' perspective even for the same image. So, image captioning needs to implement this phenomenon. Therefore, in this study, we propose a method to generate captions specialized in each domain for the image by utilizing the expertise of experts in the corresponding domain. Specifically, after performing pre-training on a large amount of general data, the expertise in the field is transplanted through transfer-learning with a small amount of expertise data. However, simple adaption of transfer learning using expertise data may invoke another type of problems. Simultaneous learning with captions of various characteristics may invoke so-called 'inter-observation interference' problem, which make it difficult to perform pure learning of each characteristic point of view. For learning with vast amount of data, most of this interference is self-purified and has little impact on learning results. On the contrary, in the case of fine-tuning where learning is performed on a small amount of data, the impact of such interference on learning can be relatively large. To solve this problem, therefore, we propose a novel 'Character-Independent Transfer-learning' that performs transfer learning independently for each character. In order to confirm the feasibility of the proposed methodology, we performed experiments utilizing the results of pre-training on MSCOCO dataset which is comprised of 120,000 images and about 600,000 general captions. Additionally, according to the advice of an art therapist, about 300 pairs of 'image / expertise captions' were created, and the data was used for the experiments of expertise transplantation. As a result of the experiment, it was confirmed that the caption generated according to the proposed methodology generates captions from the perspective of implanted expertise whereas the caption generated through learning on general data contains a number of contents irrelevant to expertise interpretation. In this paper, we propose a novel approach of specialized image interpretation. To achieve this goal, we present a method to use transfer learning and generate captions specialized in the specific domain. In the future, by applying the proposed methodology to expertise transplant in various fields, we expected that many researches will be actively conducted to solve the problem of lack of expertise data and to improve performance of image captioning.
Keywords
Deep Learning; Expertise Transplant; Transfer-Learning; Image Captioning; Artificial Intelligence;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Pang, G., X. Wang, F. Hao, J. Xie, X. Wang, Y. Lin, and X. Qin, "ACNN-FM: A Novel Recommender with Attention-based Convolutional Neural Network and Factorization Machines," Knowledge-Based Systems, Vol. 181, (2019), 1-13.
2 Peters, M. E., N. Mark, I. Mohi, G. Matt, C. Christopher, K. Lee, and Z. Luke, "Deep Contextualized Word Representations," arXiv:1802.05365, (2018).
3 Piotr, B., G. Eduard, J. Armand, and M. Tomas, "Enriching Word Vectors with Subword Information," arXiv:1607.04606, (2016)
4 Qi D., L. S., J. Song, E. Cui, T. Bharti, A. Sacheti, "ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data," arXive:2001.07966, (2020).
5 Ren, S., K. He, G. Ross, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Advances in Neural Information Processing Systems, Vol. 28, (2015), 91-99.
6 Ryan, K., S. Ruslan, and Z. Richard, "Multimodal Neural Language Models," in Proceedings of the International Conference on Machine Learning, Vol. 32, (2014), 592-603.
7 Sanjiban, S. R., M. Abhinav, G. Rishab, S. O. Mohammad, and P. V. Krishna, "A Deep Learning Based Artificial Neural Network Approach for Intrusion Detection," in Proceedings of the International Conference Mathematics and Computing, (2017), 44-53.
8 Hochreiter, S. and S. Jurgen, "Long Short-Term Memory," Neural Computation, Vol. 9, No. 8, (1997), 1735-1780.   DOI
9 Pan, S. J. and Q. Yang, "A Survey on Transfer Learning," IEEE Transactions on Knowledge and Data Engineering, Vol. 22, No. 10, (2010), 1345-1359.   DOI
10 Tan, C., F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, "A Survey on Deep Transfer Learning," arXiv:1808.01974, (2018).
11 Tomas, M., K. Chen, C. Greg, and D. Jeffrey, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, (2013).
12 Chen, L., T. Zhang, and Y. Chen, "Customer Purchase Intent Prediction Under Online Multi-Channel Promotion: A Feature-Combined Deep Learning Framework," IEEE Access, Vol. 7, (2019), 112963-112976.   DOI
13 Alex, K., S. Ilya, and E. H. Geoffrey, "ImageNet Classification with Deep Convolutional Neural Networks," Advances in Neural Information Processing Systems, Vol. 25, (2012), 1097-1105.
14 Ali, F. B., G. Lluis, R. Marcal, and D. Karatzas, "Good News, Everyone! Context Driven Entity-Aware Captioning for News Images," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2019), 12466-12475.
15 Tomas, M., S. Ilya, K. Chen, C. Greg, and D. Jeffrey, "Distributed Representations of Words and Phrases and their Compositionality," Advances in Neural Information Processing Systems, Vol. 26, (2013), 3111-3119.
16 Ashnish, V., S. Noam, P. Niki, U. Jakob, J. Llion, N. G. Aidan, K. Lukasz, and P. Illia, "Attention is All You Need,", arXiv:1706.03762, (2017).
17 Xu, K., J. Ba, K. Ryan, K. Cho, C. Aaron, S. Ruslan, S. Z. Richard, and B. Yoshua, "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention," in Proceedings of the International Conference on Machine Learning, Vol. 32, (2015), 2048-2057.
18 Yang, Y., L. Zheng, J. Zhang, Q. Cui, Z. Li, and P. S. Yu, "TI-CNN: Convolutional Neural Networks for Fake News Detection," arXiv:1806.00749, (2018).
19 Yang, Z., Z. Dai, Y. Yang, C. Jaime, R. S. Russ, and Q. V. Le, "XLNet: Generalized Autoregressive Pretraining for Language Understanding," arXiv:1906.08237, (2019).
20 Caigny, A. D., C. Krsitof, W. D. B. Koen, and L. Stefan, "Incorporating Textual Information in Customer Churn Prediction Models Based on a Convolutional Neural Network," International Journal of Forecasting, (2019), 1-16.
21 Christain, S., W. Liu, Y. Jia, S. Pierre, R. Scott, A. Dragomir, E. Dumitru, V. Vincent, and R. Andrew, "Going Deeper with Convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 1-9.
22 Devlin, J., MW. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, (2018).
23 Feng, M., T. Shaonan, C. Lee, and M. Ling, "Deep Learning Models for Bankruptcy Prediction Using Textual Disclosures," European Journal of Operational Research, Vol. 274, No. 2, (2019), 743-758.   DOI
24 Forrest, N. I., S. Han, W. M. Matthew, A. Khalid, J. D. William, and K. Kurt, "SqueezeNet:AlexNet-level Accuracy with 50x Fewer Parameters and <0.5MB Model Size," arXiv:1602.07360, (2016).
25 Gan, C., Z. Gan, X. He, J. Gao, and D. Li, "StyleNet: Generating Attractive Visual Captions with Styles," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 3137-3146.
26 He, K., X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770-778.
27 Ian, G., B. Yoshua., and C. Aaron, Deep Learning, MIT Press, United Strates, 2016.
28 Hossain, M. D. Z., S. Ferdous, F. S. Mohd, and L. Hamid, "A Comprehensive Survey of Deep Learning for Image Captioning," ACM Computing Surveys, Vol. 51, No. 6, (2019), 1-36.
29 Buck J.N., "The H-T-P test," Journal of Clinical Psychology, Vol 4, (1948), 151-159.   DOI
30 Huang, G., Z. Liu, V. D. M. Laurens, and Q.W. Kilian, "Densely Connected Convolutional Networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 4700-4708.
31 Jeffrey, P., S. Richard., and D. M. Christopher, "Glove: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014), 1532-1543.
32 Justin, J., K. Andrej, and F. Li., "Densecap: Fully Convolutional Localization Networks for Dense Captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 4565-4574.
33 Karl, W., M. K. Taghi, and D. Wang, "A Survey of Transfer Learning," Journal of Big Data, Vol. 3, (2016) 1-40.
34 Kim, B. N., J. W. Choi, H. S. Ko, "Replication crisis in psychology: A review of its causes and solutions," Korean Journal of Psychology:general, Vol. 36. No. 3, (2017), 359-396.   DOI
35 Micheal, I. J., "Attractor Dynamics and Parallelism in a Connectionist Sequential Machine," Artificial Neural Networks: Concept Learning, (1990), 112-127.
36 Lecun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, Vol. 1, No. 4, (1989), 541-551.   DOI
37 Liu, Y. and L. Wu, "Geological Disaster Recognition on Optical Remote Sensing Images Using Deep Learning," Procedia Computer Science, Vol. 91, (2016), 566-575.   DOI
38 Marc, T., G. Albert, and P. C. Kenneth, "Transfer Learning from Language Models to Image Caption Generators: Better Models may not Transfer Better," arXiv:1901.01216, (2019).