DOI QR코드

DOI QR Code

한국어-영어 법률 말뭉치의 로컬 이중 언어 임베딩

Utilizing Local Bilingual Embeddings on Korean-English Law Data

  • 투고 : 2018.08.07
  • 심사 : 2018.10.20
  • 발행 : 2018.10.28

초록

최근 이중 언어 임베딩(bilingual word embedding) 관련 연구들이 각광을 받고 있다. 그러나 한국어와 특정 언어로 구성된 병렬(parallel-aligned) 말뭉치로 이중 언어 워드 임베딩을 하는 연구는 질이 높은 많은 양의 말뭉치를 구하기 어려우므로 활발히 이루어지지 않고 있다. 특히, 특정 영역에 사용할 수 있는 로컬 이중 언어 워드 임베딩(local bilingual word embedding)의 경우는 상대적으로 더 희소하다. 또한 이중 언어 워드 임베딩을 하는 경우 번역 쌍이 단어의 개수에서 일대일 대응을 이루지 못하는 경우가 많다. 본 논문에서는 로컬 워드 임베딩을 위해 한국어-영어로 구성된 한국 법률 단락 868,163개를 크롤링(crawling)하여 임베딩을 하였고 3가지 연결 전략을 제안하였다. 본 전략은 앞서 언급한 불규칙적 대응 문제를 해결하고 단락 정렬 말뭉치에서 번역 쌍의 질을 향상시켰으며 베이스라인인 글로벌 워드 임베딩(global bilingual word embedding)과 비교하였을 때 2배의 성능을 확인하였다.

Recently, studies about bilingual word embedding have been gaining much attention. However, bilingual word embedding with Korean is not actively pursued due to the difficulty in obtaining a sizable, high quality corpus. Local embeddings that can be applied to specific domains are relatively rare. Additionally, multi-word vocabulary is problematic due to the lack of one-to-one word-level correspondence in translation pairs. In this paper, we crawl 868,163 paragraphs from a Korean-English law corpus and propose three mapping strategies for word embedding. These strategies address the aforementioned issues including multi-word translation and improve translation pair quality on paragraph-aligned data. We demonstrate a twofold increase in translation pair quality compared to the global bilingual word embedding baseline.

키워드

참고문헌

  1. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 3111-3119.
  2. J. Turian, L. Ratinov, Y. Bengio. (2010). Word representations: a simple and general method for semi-supervised learning. ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 384-394.
  3. J. Guo, W. Che, H. Wang, T. Liu (2014). Revisiting embedded features for simple semisupervised learning. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 110-120.
  4. S. Gouws, A. Sogaard (2013). Simple task-specific bilingual word embeddings. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1386-1390.
  5. M. Artetxe, G. Labaka, E. Agirre (2017). Learning bilingual word embeddings with (almost) no bilingual data. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1, 451-462.
  6. D. Y. Lee, W. H. Yu, H. S. Lim (2017). Bi-directional LSTM-CNN-CRF for Korean Named Entity Recognition System with Feature Augmentation. Korea Convergence Society, 8(12), 55-62.
  7. D. Y. Lee, J. C. Jo, H. S. Lim (2017). User Sentiment Analysis on Amazon Fashion Product Review Using Word Embedding. Korea Convergence Society, 8(4), 1-8.
  8. S. H. Lee, C. H. Lee, H. S. Lim (2017). Bilingual Word Embedding Using Parallel Corpus. Korean Institute of Information Scientists and Engineers, 645-647.
  9. F. Diaz, B. Mitra, N. Craswell (2016). Query Expansion with Locally-Trained Word Embeddings. arXiv preprints, 1605.07891.
  10. Y. Goldberg, O. Levy (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprints, 1402.3722.
  11. T. Mikolov, K. Chen, G. Corrado, J. Dean (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprints, 1301.3781.
  12. S. Ruder, I. Vulic, A. Sogaard (2017). A Survey of Cross-Lingual Word Embedding Models. arXiv preprints, 1706.04902.
  13. T. Mikolov, Q. V. Le, I. Sutskever, (2013). Exploiting Similarities among Languages for Machine Translation. arXiv preprints, 1309.4168.
  14. M. Faruqui, C. Dyer (2014). Improving Vector Space Word Representations Using Multilingual Correlation. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 462-471.
  15. L. Duong, H. Kanayama, T. Ma, S. Bird, T. Cohn, (2016). Learning crosslingual word embeddings without bilingual corpora. arXiv preprints, 1606.09403.
  16. KM. Hermann, P. Blunsom (2013). Multilingual Distributed Representations without Word Alignment. arXiv preprints, 1312.6173.
  17. A. Klementiev, I. Titov, B. Bhattarai (2012). Inducing Crosslingual Distributed Representations of Words. Proceedings of COLING 2012, 1459-1474.
  18. S. H. Yun, Y. T. Kim (1993). Idiom-Based Analysis of Natural Language for Machine Translation. Korean Institute of Information Scientists and Engineers, 20(8), 1148-1158.