Browse > Article
http://dx.doi.org/10.15207/JKCS.2018.9.10.045

Utilizing Local Bilingual Embeddings on Korean-English Law Data  

Choi, Soon-Young (Dept. of Computer Science and Engineering, Korea University)
Matteson, Andrew Stuart (Dept. of Computer Science and Engineering, Korea University)
Lim, Heui-Seok (Dept. of Computer Science and Engineering, Korea University)
Publication Information
Journal of the Korea Convergence Society / v.9, no.10, 2018 , pp. 45-53 More about this Journal
Abstract
Recently, studies about bilingual word embedding have been gaining much attention. However, bilingual word embedding with Korean is not actively pursued due to the difficulty in obtaining a sizable, high quality corpus. Local embeddings that can be applied to specific domains are relatively rare. Additionally, multi-word vocabulary is problematic due to the lack of one-to-one word-level correspondence in translation pairs. In this paper, we crawl 868,163 paragraphs from a Korean-English law corpus and propose three mapping strategies for word embedding. These strategies address the aforementioned issues including multi-word translation and improve translation pair quality on paragraph-aligned data. We demonstrate a twofold increase in translation pair quality compared to the global bilingual word embedding baseline.
Keywords
Bilingual word embedding; natural language processing; domain-specific; law domain; dictionary seed; semi-supervised training; paragraph-aligned; word similarity; skip-gram; local embedding;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 F. Diaz, B. Mitra, N. Craswell (2016). Query Expansion with Locally-Trained Word Embeddings. arXiv preprints, 1605.07891.
2 Y. Goldberg, O. Levy (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprints, 1402.3722.
3 T. Mikolov, K. Chen, G. Corrado, J. Dean (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprints, 1301.3781.
4 S. Ruder, I. Vulic, A. Sogaard (2017). A Survey of Cross-Lingual Word Embedding Models. arXiv preprints, 1706.04902.
5 T. Mikolov, Q. V. Le, I. Sutskever, (2013). Exploiting Similarities among Languages for Machine Translation. arXiv preprints, 1309.4168.
6 M. Faruqui, C. Dyer (2014). Improving Vector Space Word Representations Using Multilingual Correlation. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 462-471.
7 L. Duong, H. Kanayama, T. Ma, S. Bird, T. Cohn, (2016). Learning crosslingual word embeddings without bilingual corpora. arXiv preprints, 1606.09403.
8 KM. Hermann, P. Blunsom (2013). Multilingual Distributed Representations without Word Alignment. arXiv preprints, 1312.6173.
9 A. Klementiev, I. Titov, B. Bhattarai (2012). Inducing Crosslingual Distributed Representations of Words. Proceedings of COLING 2012, 1459-1474.
10 S. H. Yun, Y. T. Kim (1993). Idiom-Based Analysis of Natural Language for Machine Translation. Korean Institute of Information Scientists and Engineers, 20(8), 1148-1158.
11 S. Gouws, A. Sogaard (2013). Simple task-specific bilingual word embeddings. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1386-1390.
12 T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 3111-3119.
13 J. Turian, L. Ratinov, Y. Bengio. (2010). Word representations: a simple and general method for semi-supervised learning. ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 384-394.
14 J. Guo, W. Che, H. Wang, T. Liu (2014). Revisiting embedded features for simple semisupervised learning. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 110-120.
15 S. H. Lee, C. H. Lee, H. S. Lim (2017). Bilingual Word Embedding Using Parallel Corpus. Korean Institute of Information Scientists and Engineers, 645-647.
16 M. Artetxe, G. Labaka, E. Agirre (2017). Learning bilingual word embeddings with (almost) no bilingual data. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1, 451-462.
17 D. Y. Lee, W. H. Yu, H. S. Lim (2017). Bi-directional LSTM-CNN-CRF for Korean Named Entity Recognition System with Feature Augmentation. Korea Convergence Society, 8(12), 55-62.
18 D. Y. Lee, J. C. Jo, H. S. Lim (2017). User Sentiment Analysis on Amazon Fashion Product Review Using Word Embedding. Korea Convergence Society, 8(4), 1-8.