Towards Korean-Centric Token-free Pretrained Language Model

Jong-Hun Shin;Jeong Heo;Ji-Hee Ryu;Ki-Young Lee;Young-Ae Seo;Jin Seong;Soo-Jong Lim;

Annual Conference on Human and Language Technology (한국정보과학회 언어공학연구회:학술대회논문집(한글 및 한국어 정보처리))

2023.10a
/
Pages.711-715
/
2023
/
2005-3053(pISSN)

Human and Language Technology (한국정보과학회 언어공학연구회)

Towards Korean-Centric Token-free Pretrained Language Model

한국어 중심의 토큰-프리 언어 이해-생성 모델 사전학습 연구

Jong-Hun Shin (Electronics and Telecommunications Research Institute, Language Intelligence Research Section) ;
Jeong Heo (Electronics and Telecommunications Research Institute, Language Intelligence Research Section) ;
Ji-Hee Ryu (Electronics and Telecommunications Research Institute, Language Intelligence Research Section) ;
Ki-Young Lee (Electronics and Telecommunications Research Institute, Language Intelligence Research Section) ;
Young-Ae Seo (Electronics and Telecommunications Research Institute, Language Intelligence Research Section) ;
Jin Seong (Electronics and Telecommunications Research Institute, Language Intelligence Research Section) ;
Soo-Jong Lim (Electronics and Telecommunications Research Institute, Language Intelligence Research Section)

신종훈 (한국전자통신연구원 언어지능연구실) ;
허정 (한국전자통신연구원 언어지능연구실) ;
류지희 (한국전자통신연구원 언어지능연구실) ;
이기영 (한국전자통신연구원 언어지능연구실) ;
서영애 (한국전자통신연구원 언어지능연구실) ;
성진 (한국전자통신연구원 언어지능연구실) ;
임수종 (한국전자통신연구원 언어지능연구실)

Published : 2023.10.12

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

본 연구는 대부분의 언어 모델이 사용하고 있는 서브워드 토큰화 과정을 거치지 않고, 바이트 단위의 인코딩을 그대로 다룰 수 있는 토큰-프리 사전학습 언어모델에 대한 것이다. 토큰-프리 언어모델은 명시적인 미등록어 토큰이 존재하지 않고, 전 처리 과정이 단순하며 다양한 언어 및 표현 체계에 대응할 수 있는 장점이 있다. 하지만 관련 연구가 미흡, 서브워드 모델에 대비해 학습이 어렵고 낮은 성능이 보고되어 왔다. 본 연구에서는 한국어를 중심으로 토큰-프리 언어 이해-생성 모델을 사전 학습 후, 서브워드 기반 모델과 비교하여 가능성을 살펴본다. 또한, 토큰 프리 언어모델에서 지적되는 과도한 연산량을 감소시킬 수 있는 그래디언트 기반 서브워드 토크나이저를 적용, 처리 속도를 학습 2.7배, 추론 1.46배 개선하였다.

Keywords

Acknowledgement

이 논문은 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임(No. RS-2022-00187238, 효율적 사전학습이 가능한 한국어 대형 언어모델 사전학습 기술 개발).