DOI QR코드

DOI QR Code

Nonlinear Vector Alignment Methodology for Mapping Domain-Specific Terminology into General Space

전문어의 범용 공간 매핑을 위한 비선형 벡터 정렬 방법론

  • Kim, Junwoo (Graduate School of Business IT, Kookmin University) ;
  • Yoon, Byungho (Graduate School of Business IT, Kookmin University) ;
  • Kim, Namgyu (Graduate School of Business IT, Kookmin University)
  • 김준우 (국민대학교 비즈니스IT전문대학원) ;
  • 윤병호 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 비즈니스IT전문대학원)
  • Received : 2022.05.24
  • Accepted : 2022.06.19
  • Published : 2022.06.30

Abstract

Recently, as word embedding has shown excellent performance in various tasks of deep learning-based natural language processing, researches on the advancement and application of word, sentence, and document embedding are being actively conducted. Among them, cross-language transfer, which enables semantic exchange between different languages, is growing simultaneously with the development of embedding models. Academia's interests in vector alignment are growing with the expectation that it can be applied to various embedding-based analysis. In particular, vector alignment is expected to be applied to mapping between specialized domains and generalized domains. In other words, it is expected that it will be possible to map the vocabulary of specialized fields such as R&D, medicine, and law into the space of the pre-trained language model learned with huge volume of general-purpose documents, or provide a clue for mapping vocabulary between mutually different specialized fields. However, since linear-based vector alignment which has been mainly studied in academia basically assumes statistical linearity, it tends to simplify the vector space. This essentially assumes that different types of vector spaces are geometrically similar, which yields a limitation that it causes inevitable distortion in the alignment process. To overcome this limitation, we propose a deep learning-based vector alignment methodology that effectively learns the nonlinearity of data. The proposed methodology consists of sequential learning of a skip-connected autoencoder and a regression model to align the specialized word embedding expressed in each space to the general embedding space. Finally, through the inference of the two trained models, the specialized vocabulary can be aligned in the general space. To verify the performance of the proposed methodology, an experiment was performed on a total of 77,578 documents in the field of 'health care' among national R&D tasks performed from 2011 to 2020. As a result, it was confirmed that the proposed methodology showed superior performance in terms of cosine similarity compared to the existing linear vector alignment.

최근 워드 임베딩이 딥러닝 기반 자연어 처리를 다루는 다양한 업무에서 우수한 성능을 나타내면서, 단어, 문장, 그리고 문서 임베딩의 고도화 및 활용에 대한 연구가 활발하게 이루어지고 있다. 예를 들어 교차 언어 전이는 서로 다른 언어 간의 의미적 교환을 가능하게 하는 분야로, 임베딩 모델의 발전과 동시에 성장하고 있다. 또한 핵심 기술인 벡터 정렬(Vector Alignment)은 임베딩 기반 다양한 분석에 적용될 수 있다는 기대에 힘입어 학계의 관심이 더욱 높아지고 있다. 특히 벡터 정렬은 최근 수요가 높아지고 있는 분야간 매핑, 즉 대용량의 범용 문서로 학습된 사전학습 언어모델의 공간에 R&D, 의료, 법률 등 전문 분야의 어휘를 매핑하거나 이들 전문 분야간의 어휘를 매핑하기 위한 실마리를 제공할 수 있을 것으로 기대된다. 하지만 학계에서 주로 연구되어 온 선형 기반 벡터 정렬은 기본적으로 통계적 선형성을 가정하기 때문에, 본질적으로 상이한 형태의 벡터 공간을 기하학적으로 유사한 것으로 간주하는 가정으로 인해 정렬 과정에서 필연적인 왜곡을 야기한다는 한계를 갖는다. 본 연구에서는 이러한 한계를 극복하기 위해 데이터의 비선형성을 효과적으로 학습하는 딥러닝 기반 벡터 정렬 방법론을 제안한다. 제안 방법론은 서로 다른 공간에서 벡터로 표현된 전문어 임베딩을 범용어 임베딩 공간에 정렬하는 스킵연결 오토인코더와 회귀 모델의 순차별 학습으로 구성되며, 학습된 두 모델의 추론을 통해 전문 어휘를 범용어 공간에 정렬할 수 있다. 제안 방법론의 성능을 검증하기 위해 2011년부터 2020년까지 수행된 국가 R&D 과제 중 '보건의료' 분야의 문서 총 77,578건에 대한 실험을 수행한 결과, 제안 방법론이 기존의 선형 벡터 정렬에 비해 코사인 유사도 측면에서 우수한 성능을 나타냄을 확인하였다.

Keywords

References

  1. Artetxe, M., G. Labaka, and E. Agirre, "Learning Principled Bilingual Mappings of Word Embeddings while Preserving Monolingual Invariance," In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), 2289-2294.
  2. Ashish, K., P. Jain, and R. Viswanathan, "Multilabel Classification using Bayesian Compressed Sensing," Advances in Neural Information Processing Systems, (2012).
  3. Biesialska, M. and M. R. Costa-jussa, "Refinement of Unsupervised Cross-lingual Word Embeddings," arXiv:2002.09213, (2020).
  4. Bingyu, W., L. Chen, W. Sun, K. Qin, K. Li, and H. Zhou, "Ranking-based Autoencoder for Extreme Multi-label Classification," arXiv: 1904.05937, (2019).
  5. Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, ... and D. Amodei, (2020). "Language Models Are Few-shot Learners," Advances in Neural Information Processing Systems, Vol.33, (2020), 1877-1901.
  6. Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, Vol.41, No.6(1990), 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  7. Devlin, J., W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, (2018).
  8. Farbound, T. and H-T. Lin, "Multilabel Classification with Principal Label Space Transformation," Neural Computation, Vol.24, No.9(2012), 2508-2542. https://doi.org/10.1162/NECO_a_00320
  9. Grave, E., A. Joulin, and Q. Berthet, "Unsupervised Alignment of Embeddings with Wasserstein Procrustes," In The 22nd International Conference on Artificial Intelligence and Statistics, (2019), 1880-1890.
  10. Gu, Y., K. Yang, S. Fu, S. Chen, X. Li, and I. Marsic, "Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-level Alignment," In Proceedings of the Conference Association for Computational Linguistics Meeting, (2018), 2225-2235.
  11. Guo, H., J. Tang, W. Zeng, X. Zhao, and L. Liu, "Multi-modal Entity Alignment in Hyperbolic Space," Neurocomputing, Vol.461, (2021), 598-607. https://doi.org/10.1016/j.neucom.2021.03.132
  12. Hermann, K. M. and P. Blunsom, "Multilingual Distributed Representations without Word Alignment," In Proceedings of ICLR, (2013)
  13. Jeffrey, P., S. Richard, and D. M. Christopher, "Glove: Global Vectors for Word Representation," In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014), 1532-1543.
  14. Jorg, W., A. Tyukin, and S. Kramer, "A Nonlinear Label Compression and Transformation Method for Multi-label Classification Using Autoencoders," Advances in Knowledge Discovery and Data Mining, (2016), 328-340.
  15. Kim, M. and N. Kim, "Label Embedding for Improving Classification Accuracy Using Autoencoder with Skip-Connections," Journal of Intelligence and Information Systems, Vol.27, No.3(2021), 175-197. https://doi.org/10.13088/JIIS.2021.27.3.175
  16. Lample, G. and A. Conneau, "Cross-lingual Language Model Pretraining," arXiv:1901.07291, (2019).
  17. Lee, M., S. Yang, and H. Lee, "Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity," Journal of Intelligence and Information Systems, Vol.25, No.4 (2019), 105-122. https://doi.org/10.13088/JIIS.2019.25.4.105
  18. Lin, Z., G. Ding, M. Hu, and J. Wang, "Multi-label Classification via Feature-aware Implicit Label Space Encoding," In International Conference on Machine Learning, (2014), 325-333.
  19. Mikolov, T., K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, (2013).
  20. Mikolov, T., Q. V. Le, and I. Sutskever, "Exploiting Similarities Among Languages for Machine Translation," arXiv:1309.4168, (2013).
  21. Nakashole, N. and R. Flauger, "Characterizing Departures from Linearity in Word Translation," arXiv:1806.04508, (2018).
  22. Park, H. and K. Shin, "Aspect-Based Sentiment Analysis Using BERT: Developing Aspect Category Sentiment Classification Models," Journal of Intelligence and Information Systems, Vol.26, No.4(2020), 1-25. https://doi.org/10.13088/JIIS.2020.26.4.001
  23. Patra, B., J. R. A. Moniz, S. Garg, M. R. Gormley, and G. Neubig, "Bilingual Lexicon Induction with Semi-supervision in Non-isometric Embedding Spaces," arXiv:1908.06625, (2019).
  24. Peters, M. E., N. Mark, I. Mohi, G. Matt, C. Christopher, K. Lee, and Z. Luke, "Deep Contextualized Word Representations," arXiv: 1802.05365, (2018).
  25. Piotr, B., G. Eduard, J. Armand, and M. Tomas, "Enriching Word Vectors with Subword Information," arXiv:1607.04606, (2016).
  26. Schonemann, P. H., "A Generalized Solution of the Orthogonal Procrustes Problem," Psychometrika, Vol.31, No.1(1966), 1-10. https://doi.org/10.1007/BF02289451
  27. Sogaard, A., S. Ruder, and I. Vulic, "On the Limitations of Unsupervised Bilingual Dictionary Induction," arXiv:1805.03620, (2018).
  28. Tai, W., H. T. Kung, X. L. Dong, M. Comiter, and C. F. Kuo, "exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources," In Findings of the Association for Computational Linguistics: EMNLP 2020, (2020), 1433-1439.
  29. Vulic, I. and M. F. Moens, "A Study on Bootstrapping Bilingual Vector Spaces from Non-parallel Data (and Nothing Else)," In Proceedings of EMNLP, (2013), 1613-1624.
  30. Vulic, I., G. Glavas, R. Reichart, and A. Korhonen, "Do We Really Need Fully Unsupervised Cross-lingual Embeddings?," arXiv:1909.01638, (2019).
  31. Wu, S. and M. Dredze, "Beto, Bentz, Becas: The Surprising Cross-lingual Effectiveness of BERT," arXiv:1904.09077 (2019).
  32. Xing, C., D. Wang, C. Liu, and Y. Lin, "Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation," In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (2015), 1006-1011.
  33. Yao-Nan. Chen and H-T. Lin, "Feature-aware Label Space Dimension Reduction for Multi-label Classification," Advances in Neural Information Processing Systems 25, (2012).
  34. Yu, E., S. Seo, and N. Kim, "Building Specialized Language Model for National R&D through Knowledge Transfer Based on Further Pre-training," Knowledge Management Research, Vol.22, No.3(2021), 91-106.