[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5859/KAIS.2021.30.2.57

A Method for Learning the Specialized Meaning of Terminology through Mixed Word Embedding

Kim, Byung Tae (국민대학교 비즈니스 IT 전문대학원)
Kim, Nam Gyu (국민대학교 비즈니스 IT 전문대학원)

Publication Information

The Journal of Information Systems / v.30, no.2, 2021 , pp. 57-78 More about this Journal

Abstract

Purpose In this study, first, we try to make embedding results that reflect the characteristics of both professional and general documents. In addition, when disparate documents are put together as learning materials for natural language processing, we try to propose a method that can measure the degree of reflection of the characteristics of individual domains in a quantitative way. Approach For this study, the Korean Supreme Court Precedent documents and Korean Wikipedia are selected as specialized documents and general documents respectively. After extracting the most similar word pairs and similarities of unique words observed only in the specialized documents, we observed how those values were changed in the process of embedding with general documents. Findings According to the measurement methods proposed in this study, it was confirmed that the degree of specificity of specialized documents was relaxed in the process of combining with general documents, and that the degree of dissolution could have a positive correlation with the size of general documents.

Keywords

Corpus; Legal Document; Word Embedding; Word2Vec; Word Similarity; Evaluation for Mixed Embedding;

Citations & Related Records

Reference

1	Bhattacharya, P., Hiware, K., Rajgaria, S., Pochhi, N., Pochhi, "A Comparative Study of Summarization Algorithms Applied to Legal Case Judgments," In Proceedings of ECIR, Springer, pp. 413-428, April 2019.
2	Google Books Ngram Corpus : http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
3	최순영, Matteson, A. S., 임희석, "한국어-영어법률 말뭉치의 로컬 이중 언어 임베딩," 한국융합학회논문지, 제9권, 제10호, 2018, pp. 45-53. DOI
4	Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J., "Distributed Representations of Words and Phrases and Their Compositionality," arXiv: 1310.4546, Oct 2013.
5	Kim, M., Goebel, R., "Two-step Cascaded Textual Entailment for Legal Bar Exam Question Answering," ICAIL pp. 283-290, 2017.
6	Medical Corpus: English Corpus From Web: https://www.sketchengine.eu/medical-web-corpus/
7	Oxford English Corpus : https://languages.oup.com/research/
8	wordNet (Princeton Univ) : https://wordnet.princeton.edu/
9	Cardellino, C., Teruel, M., Alonso L, Villata, S., "Legal NERC with Ontologies, Wikipedia and Curriculum Learning," In Proceedings of EACL, pp.254-259, 2017.
10	Church, K. W., "Word2Vec," Natural Language Engineering, Cambridge University Press, Cambridge UK, Dec 2016.
11	Garain, A,. Mahata, S., K., Dutta S., "Normalyzing Numeronyms - A NLP approach," arXiv:1907.13356, Jul 2019.
12	Garneau, N., Leboeuf, J., and Lamontagne, L., "Predicting and Interpreting Embeddings for Out-of-vocabulary Words in Downstream Tasks," arXiv:1903.00724, Mar 2019.
13	Golitsyna, O. L., Maksimov, N. V., and Fedorova V. A., "On Determining Semantic Similarity Based on Relationships of a Combined Thesaurus," Automatic Documentation and Mathematical Linguistics, Vol 50, pp. 139-153, 2016. DOI
14	Gravetter, F., and Wallnau, L., Statistics for Behavior Sciences, Cengage Learning, US, 2017.
15	Pilehvar, M., Camacho-Collados, J., Navigli, R., and Collier, N., "Towards a Seamless Integration of Word Senses into Downstream NLP Applications," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, Vancouver, Canada, 2017, pp. 1857-186.
16	Leech, G., and Svartvik, J., "Corpora and Theories of Linguistic Performance," Directions in Corpus Linguistics, Corpora and Theories of lLnguistic Performance, Walter de Gruyter, Berlin, 1992.
17	Mikolov, T., Chen, K., Corrado, G., and Dean, J., "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, Jan 2013.
18	Mooers, C., "The Theory of Digital Handling of Non-Numerical Information and its Implications to Machine Economics," in Proceedings of The Meeting of The Association for Computing Machinery at Rutgers University, Mar 1950.
19	Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., Sun, M., "How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence," arXiv:2004. 12158v5, May 2020.
20	대한민국 법원 종합법률정보 : https://glaw.scourt.go.kr/
21	한국어 위키백과 : https://ko.wikipedia.org/wiki/ 위키백과
22	Islam, A., and Inkpen, D., "Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity," ACM Transactions on Knowledge Discovery from Data, Article No. 10, Jul 2008.
23	모두의 말뭉치, 국립 국어원, 2020 : https://corpus.korean.go.kr/
24	윤상훈, 김근형, "Word2Vec를 이용한 토픽모델링의 확장 및 분석사례," 한국정보시스템학회 정보시스템연구, 제30권, 제1호, 2021, pp. 45-64
25	Jurafsky, D., and Martin, J., Speech and Language Processing, Pearson Education, Upper Saddle River, New Jersey, 2009.
26	British Legal Report Corpus : https://www.sketchengine.eu/blarc-british-law-reference-corpus
27	GPT-3 (openAI) : https://openai.com/blog/openai-api/
28	대법원 사법연감(통계) 2019 : https://www.scourt.go.kr/portal/justicesta/JusticestaListAction.work?gubun=10
29	Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L., "Deep Contextualized Word Representations," arXiv: 1802. 05365, Feb 2018.
30	대법원 보도자료 "전국법원장 회의", 2020. 12. : https://www.scourt.go.kr/portal/news/News ViewAction.work?seqnum=1931
31	Brown Corpus : http://icame.uib.no/brown/bcm.html
32	강현화, "전문용어 연구의 맹점," 나라사랑, 제125집, 2016, pp. 191-215.
33	이기창, 한국어 임베딩, 도서출판 에이콘, 2019. pp. 80
34	김나리, 김형중, "연관법령 검색을 위한 워드 임베딩 기반 Law2Vec 모형 연구," 한국디지털콘텐츠학회 논문지, 제18권, 제7호, 2017, pp. 1,419-1,425.
35	김한샘, "말뭉치 기반 한국어 연구의 현황과 전망," 한국어학회 한국어학, 제83권, 2019, pp. 1-33.
36	오선영, "코퍼스와 영어교육," 외국어 교육연구 제7집, 2004, pp 1-38.
37	최병설, 김남규, "감정 딥러닝 필터를 활용한 토픽 모델링 방법론," 한국정보시스템학회 정보시스템연구, 제28권, 제4호, 2019, pp. 271-291
38	현암사 법전부, 법률용어 사전, 현암사, 2019.
39	Adewumi, T. P., Liwicki, F., Liwicki, M., "Word2Vec: Optimal Hyper-Parameters and their Impact on NLP Downstream Tasks", arXiv:2003.11645, Mar 2020.
40	Ashley, K. D., Artificial Intelligence and Legal Analytics, Cambridge University Press, Cambridge, UK, 2017.
41	Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C., "A Neural Probabilistic Language Model," The Journal of Machine Learning Research, 2003, pp. 1137-1155.
42	Chen, H., Cai, D., Dai, W., Dai, Z., Ding, Y., "Charge-Based Prison Term Prediction with Deep Gating Network," arXiv: 1908.11521v1, Aug 2019.
43	Devlin, J., Chang, M., Lee, K., and Toutanova, K., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv: 1810.04805, Oct 2018.
44	Doerr, M., "Semantic Problems of Thesaurus Mapping," Journal of Digital Information, Vol.1 No 8, 2001.
45	Faruqui, M., Tsvetkov, Y., Rastogi, P., Dyer, C., "Problems With Evaluation of Word Embeddings Using Word Similarity Tasks," arXiv:1605.02276, May 2016.
46	Mecab : https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/
47	Pennington, J., Socher, R., and Manning, C., "GloVe: Global Vectors for Word Representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Oct 2014, pp. 1532-1543.

KSCI

A Method for Learning the Specialized Meaning of Terminology through Mixed Word Embedding 혼합 임베딩을 통한 전문 용어 의미 학습 방안

A Method for Learning the Specialized Meaning of Terminology through Mixed Word Embedding