[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.13088/jiis.2022.28.2.127

Nonlinear Vector Alignment Methodology for Mapping Domain-Specific Terminology into General Space

Kim, Junwoo (Graduate School of Business IT, Kookmin University)
Yoon, Byungho (Graduate School of Business IT, Kookmin University)
Kim, Namgyu (Graduate School of Business IT, Kookmin University)

Publication Information

Journal of Intelligence and Information Systems / v.28, no.2, 2022 , pp. 127-146 More about this Journal

Abstract

Recently, as word embedding has shown excellent performance in various tasks of deep learning-based natural language processing, researches on the advancement and application of word, sentence, and document embedding are being actively conducted. Among them, cross-language transfer, which enables semantic exchange between different languages, is growing simultaneously with the development of embedding models. Academia's interests in vector alignment are growing with the expectation that it can be applied to various embedding-based analysis. In particular, vector alignment is expected to be applied to mapping between specialized domains and generalized domains. In other words, it is expected that it will be possible to map the vocabulary of specialized fields such as R&D, medicine, and law into the space of the pre-trained language model learned with huge volume of general-purpose documents, or provide a clue for mapping vocabulary between mutually different specialized fields. However, since linear-based vector alignment which has been mainly studied in academia basically assumes statistical linearity, it tends to simplify the vector space. This essentially assumes that different types of vector spaces are geometrically similar, which yields a limitation that it causes inevitable distortion in the alignment process. To overcome this limitation, we propose a deep learning-based vector alignment methodology that effectively learns the nonlinearity of data. The proposed methodology consists of sequential learning of a skip-connected autoencoder and a regression model to align the specialized word embedding expressed in each space to the general embedding space. Finally, through the inference of the two trained models, the specialized vocabulary can be aligned in the general space. To verify the performance of the proposed methodology, an experiment was performed on a total of 77,578 documents in the field of 'health care' among national R&D tasks performed from 2011 to 2020. As a result, it was confirmed that the proposed methodology showed superior performance in terms of cosine similarity compared to the existing linear vector alignment.

최근 워드 임베딩이 딥러닝 기반 자연어 처리를 다루는 다양한 업무에서 우수한 성능을 나타내면서, 단어, 문장, 그리고 문서 임베딩의 고도화 및 활용에 대한 연구가 활발하게 이루어지고 있다. 예를 들어 교차 언어 전이는 서로 다른 언어 간의 의미적 교환을 가능하게 하는 분야로, 임베딩 모델의 발전과 동시에 성장하고 있다. 또한 핵심 기술인 벡터 정렬(Vector Alignment)은 임베딩 기반 다양한 분석에 적용될 수 있다는 기대에 힘입어 학계의 관심이 더욱 높아지고 있다. 특히 벡터 정렬은 최근 수요가 높아지고 있는 분야간 매핑, 즉 대용량의 범용 문서로 학습된 사전학습 언어모델의 공간에 R&D, 의료, 법률 등 전문 분야의 어휘를 매핑하거나 이들 전문 분야간의 어휘를 매핑하기 위한 실마리를 제공할 수 있을 것으로 기대된다. 하지만 학계에서 주로 연구되어 온 선형 기반 벡터 정렬은 기본적으로 통계적 선형성을 가정하기 때문에, 본질적으로 상이한 형태의 벡터 공간을 기하학적으로 유사한 것으로 간주하는 가정으로 인해 정렬 과정에서 필연적인 왜곡을 야기한다는 한계를 갖는다. 본 연구에서는 이러한 한계를 극복하기 위해 데이터의 비선형성을 효과적으로 학습하는 딥러닝 기반 벡터 정렬 방법론을 제안한다. 제안 방법론은 서로 다른 공간에서 벡터로 표현된 전문어 임베딩을 범용어 임베딩 공간에 정렬하는 스킵연결 오토인코더와 회귀 모델의 순차별 학습으로 구성되며, 학습된 두 모델의 추론을 통해 전문 어휘를 범용어 공간에 정렬할 수 있다. 제안 방법론의 성능을 검증하기 위해 2011년부터 2020년까지 수행된 국가 R&D 과제 중 '보건의료' 분야의 문서 총 77,578건에 대한 실험을 수행한 결과, 제안 방법론이 기존의 선형 벡터 정렬에 비해 코사인 유사도 측면에서 우수한 성능을 나타냄을 확인하였다.

Keywords

Word Embedding; Pre-trained Language Model; Vector Alignment; Label Embedding;

Citations & Related Records

Times Cited By KSCI : 3 (Citation Analysis)

Reference
Cited By KSCI

1	Artetxe, M., G. Labaka, and E. Agirre, "Learning Principled Bilingual Mappings of Word Embeddings while Preserving Monolingual Invariance," In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), 2289-2294.
2	Bingyu, W., L. Chen, W. Sun, K. Qin, K. Li, and H. Zhou, "Ranking-based Autoencoder for Extreme Multi-label Classification," arXiv: 1904.05937, (2019).
3	Devlin, J., W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, (2018).
4	Gu, Y., K. Yang, S. Fu, S. Chen, X. Li, and I. Marsic, "Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-level Alignment," In Proceedings of the Conference Association for Computational Linguistics Meeting, (2018), 2225-2235.
5	Jeffrey, P., S. Richard, and D. M. Christopher, "Glove: Global Vectors for Word Representation," In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014), 1532-1543.
6	Lin, Z., G. Ding, M. Hu, and J. Wang, "Multi-label Classification via Feature-aware Implicit Label Space Encoding," In International Conference on Machine Learning, (2014), 325-333.
7	Park, H. and K. Shin, "Aspect-Based Sentiment Analysis Using BERT: Developing Aspect Category Sentiment Classification Models," Journal of Intelligence and Information Systems, Vol.26, No.4(2020), 1-25. DOI
8	Peters, M. E., N. Mark, I. Mohi, G. Matt, C. Christopher, K. Lee, and Z. Luke, "Deep Contextualized Word Representations," arXiv: 1802.05365, (2018).
9	Sogaard, A., S. Ruder, and I. Vulic, "On the Limitations of Unsupervised Bilingual Dictionary Induction," arXiv:1805.03620, (2018).
10	Ashish, K., P. Jain, and R. Viswanathan, "Multilabel Classification using Bayesian Compressed Sensing," Advances in Neural Information Processing Systems, (2012).
11	Biesialska, M. and M. R. Costa-jussa, "Refinement of Unsupervised Cross-lingual Word Embeddings," arXiv:2002.09213, (2020).
12	Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, ... and D. Amodei, (2020). "Language Models Are Few-shot Learners," Advances in Neural Information Processing Systems, Vol.33, (2020), 1877-1901.
13	Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, Vol.41, No.6(1990), 391-407. DOI
14	Farbound, T. and H-T. Lin, "Multilabel Classification with Principal Label Space Transformation," Neural Computation, Vol.24, No.9(2012), 2508-2542. DOI
15	Jorg, W., A. Tyukin, and S. Kramer, "A Nonlinear Label Compression and Transformation Method for Multi-label Classification Using Autoencoders," Advances in Knowledge Discovery and Data Mining, (2016), 328-340.
16	Grave, E., A. Joulin, and Q. Berthet, "Unsupervised Alignment of Embeddings with Wasserstein Procrustes," In The 22nd International Conference on Artificial Intelligence and Statistics, (2019), 1880-1890.
17	Guo, H., J. Tang, W. Zeng, X. Zhao, and L. Liu, "Multi-modal Entity Alignment in Hyperbolic Space," Neurocomputing, Vol.461, (2021), 598-607. DOI
18	Hermann, K. M. and P. Blunsom, "Multilingual Distributed Representations without Word Alignment," In Proceedings of ICLR, (2013)
19	Vulic, I., G. Glavas, R. Reichart, and A. Korhonen, "Do We Really Need Fully Unsupervised Cross-lingual Embeddings?," arXiv:1909.01638, (2019).
20	Tai, W., H. T. Kung, X. L. Dong, M. Comiter, and C. F. Kuo, "exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources," In Findings of the Association for Computational Linguistics: EMNLP 2020, (2020), 1433-1439.
21	Wu, S. and M. Dredze, "Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT," arXiv:1904.09077 (2019).
22	Yao-Nan. Chen and H-T. Lin, "Feature-aware Label Space Dimension Reduction for Multi-label Classification," Advances in Neural Information Processing Systems 25, (2012).
23	Yu, E., S. Seo, and N. Kim, "Building Specialized Language Model for National R&D through Knowledge Transfer Based on Further Pre-training," Knowledge Management Research, Vol.22, No.3(2021), 91-106.
24	Vulic, I. and M. F. Moens, "A Study on Bootstrapping Bilingual Vector Spaces from Non-parallel Data (and Nothing Else)," In Proceedings of EMNLP, (2013), 1613-1624.
25	Piotr, B., G. Eduard, J. Armand, and M. Tomas, "Enriching Word Vectors with Subword Information," arXiv:1607.04606, (2016).
26	Mikolov, T., K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, (2013).
27	Nakashole, N. and R. Flauger, "Characterizing Departures from Linearity in Word Translation," arXiv:1806.04508, (2018).
28	Lample, G. and A. Conneau, "Cross-lingual Language Model Pretraining," arXiv:1901.07291, (2019).
29	Patra, B., J. R. A. Moniz, S. Garg, M. R. Gormley, and G. Neubig, "Bilingual Lexicon Induction with Semi-supervision in Non-isometric Embedding Spaces," arXiv:1908.06625, (2019).
30	Schonemann, P. H., "A Generalized Solution of the Orthogonal Procrustes Problem," Psychometrika, Vol.31, No.1(1966), 1-10. DOI
31	Xing, C., D. Wang, C. Liu, and Y. Lin, "Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation," In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (2015), 1006-1011.
32	Mikolov, T., Q. V. Le, and I. Sutskever, "Exploiting Similarities Among Languages for Machine Translation," arXiv:1309.4168, (2013).
33	Kim, M. and N. Kim, "Label Embedding for Improving Classification Accuracy Using Autoencoder with Skip-Connections," Journal of Intelligence and Information Systems, Vol.27, No.3(2021), 175-197. DOI
34	Lee, M., S. Yang, and H. Lee, "Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity," Journal of Intelligence and Information Systems, Vol.25, No.4 (2019), 105-122. DOI

KSCI

Nonlinear Vector Alignment Methodology for Mapping Domain-Specific Terminology into General Space 전문어의 범용 공간 매핑을 위한 비선형 벡터 정렬 방법론

Nonlinear Vector Alignment Methodology for Mapping Domain-Specific Terminology into General Space