Browse > Article
http://dx.doi.org/10.13088/jiis.2022.28.2.127

Nonlinear Vector Alignment Methodology for Mapping Domain-Specific Terminology into General Space  

Kim, Junwoo (Graduate School of Business IT, Kookmin University)
Yoon, Byungho (Graduate School of Business IT, Kookmin University)
Kim, Namgyu (Graduate School of Business IT, Kookmin University)
Publication Information
Journal of Intelligence and Information Systems / v.28, no.2, 2022 , pp. 127-146 More about this Journal
Abstract
Recently, as word embedding has shown excellent performance in various tasks of deep learning-based natural language processing, researches on the advancement and application of word, sentence, and document embedding are being actively conducted. Among them, cross-language transfer, which enables semantic exchange between different languages, is growing simultaneously with the development of embedding models. Academia's interests in vector alignment are growing with the expectation that it can be applied to various embedding-based analysis. In particular, vector alignment is expected to be applied to mapping between specialized domains and generalized domains. In other words, it is expected that it will be possible to map the vocabulary of specialized fields such as R&D, medicine, and law into the space of the pre-trained language model learned with huge volume of general-purpose documents, or provide a clue for mapping vocabulary between mutually different specialized fields. However, since linear-based vector alignment which has been mainly studied in academia basically assumes statistical linearity, it tends to simplify the vector space. This essentially assumes that different types of vector spaces are geometrically similar, which yields a limitation that it causes inevitable distortion in the alignment process. To overcome this limitation, we propose a deep learning-based vector alignment methodology that effectively learns the nonlinearity of data. The proposed methodology consists of sequential learning of a skip-connected autoencoder and a regression model to align the specialized word embedding expressed in each space to the general embedding space. Finally, through the inference of the two trained models, the specialized vocabulary can be aligned in the general space. To verify the performance of the proposed methodology, an experiment was performed on a total of 77,578 documents in the field of 'health care' among national R&D tasks performed from 2011 to 2020. As a result, it was confirmed that the proposed methodology showed superior performance in terms of cosine similarity compared to the existing linear vector alignment.
Keywords
Word Embedding; Pre-trained Language Model; Vector Alignment; Label Embedding;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Artetxe, M., G. Labaka, and E. Agirre, "Learning Principled Bilingual Mappings of Word Embeddings while Preserving Monolingual Invariance," In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), 2289-2294.
2 Bingyu, W., L. Chen, W. Sun, K. Qin, K. Li, and H. Zhou, "Ranking-based Autoencoder for Extreme Multi-label Classification," arXiv: 1904.05937, (2019).
3 Devlin, J., W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, (2018).
4 Gu, Y., K. Yang, S. Fu, S. Chen, X. Li, and I. Marsic, "Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-level Alignment," In Proceedings of the Conference Association for Computational Linguistics Meeting, (2018), 2225-2235.
5 Jeffrey, P., S. Richard, and D. M. Christopher, "Glove: Global Vectors for Word Representation," In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014), 1532-1543.
6 Lin, Z., G. Ding, M. Hu, and J. Wang, "Multi-label Classification via Feature-aware Implicit Label Space Encoding," In International Conference on Machine Learning, (2014), 325-333.
7 Park, H. and K. Shin, "Aspect-Based Sentiment Analysis Using BERT: Developing Aspect Category Sentiment Classification Models," Journal of Intelligence and Information Systems, Vol.26, No.4(2020), 1-25.   DOI
8 Peters, M. E., N. Mark, I. Mohi, G. Matt, C. Christopher, K. Lee, and Z. Luke, "Deep Contextualized Word Representations," arXiv: 1802.05365, (2018).
9 Sogaard, A., S. Ruder, and I. Vulic, "On the Limitations of Unsupervised Bilingual Dictionary Induction," arXiv:1805.03620, (2018).
10 Ashish, K., P. Jain, and R. Viswanathan, "Multilabel Classification using Bayesian Compressed Sensing," Advances in Neural Information Processing Systems, (2012).
11 Biesialska, M. and M. R. Costa-jussa, "Refinement of Unsupervised Cross-lingual Word Embeddings," arXiv:2002.09213, (2020).
12 Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, ... and D. Amodei, (2020). "Language Models Are Few-shot Learners," Advances in Neural Information Processing Systems, Vol.33, (2020), 1877-1901.
13 Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, Vol.41, No.6(1990), 391-407.   DOI
14 Farbound, T. and H-T. Lin, "Multilabel Classification with Principal Label Space Transformation," Neural Computation, Vol.24, No.9(2012), 2508-2542.   DOI
15 Jorg, W., A. Tyukin, and S. Kramer, "A Nonlinear Label Compression and Transformation Method for Multi-label Classification Using Autoencoders," Advances in Knowledge Discovery and Data Mining, (2016), 328-340.
16 Grave, E., A. Joulin, and Q. Berthet, "Unsupervised Alignment of Embeddings with Wasserstein Procrustes," In The 22nd International Conference on Artificial Intelligence and Statistics, (2019), 1880-1890.
17 Guo, H., J. Tang, W. Zeng, X. Zhao, and L. Liu, "Multi-modal Entity Alignment in Hyperbolic Space," Neurocomputing, Vol.461, (2021), 598-607.   DOI
18 Hermann, K. M. and P. Blunsom, "Multilingual Distributed Representations without Word Alignment," In Proceedings of ICLR, (2013)
19 Vulic, I., G. Glavas, R. Reichart, and A. Korhonen, "Do We Really Need Fully Unsupervised Cross-lingual Embeddings?," arXiv:1909.01638, (2019).
20 Tai, W., H. T. Kung, X. L. Dong, M. Comiter, and C. F. Kuo, "exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources," In Findings of the Association for Computational Linguistics: EMNLP 2020, (2020), 1433-1439.
21 Wu, S. and M. Dredze, "Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT," arXiv:1904.09077 (2019).
22 Yao-Nan. Chen and H-T. Lin, "Feature-aware Label Space Dimension Reduction for Multi-label Classification," Advances in Neural Information Processing Systems 25, (2012).
23 Yu, E., S. Seo, and N. Kim, "Building Specialized Language Model for National R&D through Knowledge Transfer Based on Further Pre-training," Knowledge Management Research, Vol.22, No.3(2021), 91-106.
24 Vulic, I. and M. F. Moens, "A Study on Bootstrapping Bilingual Vector Spaces from Non-parallel Data (and Nothing Else)," In Proceedings of EMNLP, (2013), 1613-1624.
25 Piotr, B., G. Eduard, J. Armand, and M. Tomas, "Enriching Word Vectors with Subword Information," arXiv:1607.04606, (2016).
26 Mikolov, T., K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, (2013).
27 Nakashole, N. and R. Flauger, "Characterizing Departures from Linearity in Word Translation," arXiv:1806.04508, (2018).
28 Lample, G. and A. Conneau, "Cross-lingual Language Model Pretraining," arXiv:1901.07291, (2019).
29 Patra, B., J. R. A. Moniz, S. Garg, M. R. Gormley, and G. Neubig, "Bilingual Lexicon Induction with Semi-supervision in Non-isometric Embedding Spaces," arXiv:1908.06625, (2019).
30 Schonemann, P. H., "A Generalized Solution of the Orthogonal Procrustes Problem," Psychometrika, Vol.31, No.1(1966), 1-10.   DOI
31 Xing, C., D. Wang, C. Liu, and Y. Lin, "Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation," In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (2015), 1006-1011.
32 Mikolov, T., Q. V. Le, and I. Sutskever, "Exploiting Similarities Among Languages for Machine Translation," arXiv:1309.4168, (2013).
33 Kim, M. and N. Kim, "Label Embedding for Improving Classification Accuracy Using Autoencoder with Skip-Connections," Journal of Intelligence and Information Systems, Vol.27, No.3(2021), 175-197.   DOI
34 Lee, M., S. Yang, and H. Lee, "Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity," Journal of Intelligence and Information Systems, Vol.25, No.4 (2019), 105-122.   DOI