Search | Korea Science

A Survey on Deep Learning-based Pre-Trained Language Models (딥러닝 기반 사전학습 언어모델에 대한 이해와 현황)

Sangun Park
- The Journal of Bigdata
- /
- v.7 no.2
- /
- pp.11-29
- /
- 2022
Pre-trained language models are the most important and widely used tools in natural language processing tasks. Since those have been pre-trained for a large amount of corpus, high performance can be expected even with fine-tuning learning using a small number of data. Since the elements necessary for implementation, such as a pre-trained tokenizer and a deep learning model including pre-trained weights, are distributed together, the cost and period of natural language processing has been greatly reduced. Transformer variants are the most representative pre-trained language models that provide these advantages. Those are being actively used in other fields such as computer vision and audio applications. In order to make it easier for researchers to understand the pre-trained language model and apply it to natural language processing tasks, this paper describes the definition of the language model and the pre-learning language model, and discusses the development process of the pre-trained language model and especially representative Transformer variants.
https://doi.org/10.36498/kbigdt.2022.7.2.11 인용 PDF KSCI

Numerical data processing on expert system for power system fault restoration - in IBM PC Turbo prolog - (계통 사고 복구 전문가 시스템에서의 수치 데이타 처리 - IBM PC 용 Turbo prolog 에서 -)

Choi, Joon-Young;Park, In-Gyu;Park, Jong-Keun
- Proceedings of the KIEE Conference
- /
- 1987.11a
- /
- pp.316-320
- /
- 1987
This paper deals with expert system for power system fault restoration and accompanying numerical data processing. Nowadays, expert system which is a branch or artificial intelligence expands its application area to many fields. And it requires computer language for A.I. to be versatile. Expert system for power system handles numerous numerical data and language for A.I. has its deficiency in numerical data processing. However some recent version of the A.I. language rind ways of overcoming this dilemma by giving the way or linking conventional algorithmic languages to them. This study presents numerical data processing routines described in Turbo prolog which is run in IBM PC and linking numerical data processing routines written in Turbo C to Turbo prolog.
PDF

High Speed Substring Analysis Algorithm for Converting from the Korean Company Name to Roman Characters (한글 상호(商號)를 로마자로 변환하기 위한 고속 부분문자열 분석 알고리즘)

Myeong-jin Hwang;Sun-ho Jo;Hyuk-chul Kwon
- Proceedings of the Korea Information Processing Society Conference
- /
- 2008.11a
- /
- pp.168-170
- /
- 2008
한글 상호(商號) 로마자 변환기는 한글로 만들어진 상호를 로마자로 자동 변환하는 시스템이다. 이 변환기는 기사용 로마자 상호명과 업종명, 그리고 표준 한글 로마자 변환 규칙에 의해 생성한 로마자를 조합하여 로마자 상호를 생성한다. 이때, 조합을 위한 알고리즘이 필요한데, 기존에 비슷한 용도에 사용되었던 stack 알고리즘을 적용할 경우 비효율적이다. 본 논문은 이를 대체할 새 알고리즘을 제안한다. 새 알고리즘은 기존 stack 알고리즘을 사용할 때에 비해 복잡도를 O(b^d)에서 O(b^*d)로 줄여 성능을 높인다.
https://doi.org/10.3745/PKIPS.y2008m011a.168 인용 PDF

Morpheme Conversion for korean Text-to-Sign Language Translation System (한국어-수화 번역시스템을 위한 형태소 변환)

Park, Su-Hyun;Kang, Seok-Hoon;Kwon, Hyuk-Chul
- The Transactions of the Korea Information Processing Society
- /
- v.5 no.3
- /
- pp.688-702
- /
- 1998
In this paper, we propose sign language morpheme generation rule corresponding to morpheme analysis for each part of speech. Korean natural sign language has extremely limited vocabulary, and the number of grammatical components eing currently used are limited, too. In this paper, therefore, we define natural sign language grammar corresponding to Korean language grammar in order to translate natural Korean language sentences to the corresponding sign language. Each phrase should define sign language morpheme generation grammar which is different from Korean language analysis grammar. Then, this grammar is applied to morpheme analysis/combination rule and sentence structure analysis rule. It will make us generate most natural sign language by definition of this grammar.
PDF

English Tutoring System Using Chatbot and Dialog System (챗봇과 대화시스템을 이용한 영어 교육 시스템)

Choi, Sung-Kwon;Kwon, Oh-Woog;Lee, Kiyoung;Roh, Yoon-Hyung;Huang, Jin-Xia;Kim, Young-Gil
- Proceedings of the Korea Information Processing Society Conference
- /
- 2017.04a
- /
- pp.958-959
- /
- 2017
본 논문은 챗봇과 대화시스템을 이용한 영어 교육 시스템을 기술하는 것을 목표로 한다. 본 논문의 시스템은 학습자의 대화 흐름을 제한하지 않고 주제를 벗어난 자유대화를 허용하며 문법오류에 대한 피드백을 한다. 챗봇과 대화시스템을 이용한 영어 교육 시스템은 대화턴 성공률로 평가되었는데, 평균 대화턴 성공률은 80.86%였으며, 주제별로는 1) 뉴욕시티투어 티켓 구매 71.86%, 2) 음식주문 71.06%, 3) 건강습관 대화 85.41%, 4) 미래화폐에 대한 생각 조사 95.09%였다. 또한 영어 문법 오류 교정도 측정되었는데 문법 오류 정확률은 66.7%, 재현율은 31.9%였다.
https://doi.org/10.3745/PKIPS.y2017m04a.958 인용 PDF

Multilingual Automatic Translation Based on UNL: A Case Study for the Vietnamese Language

Thuyen, Phan Thi Le;Hung, Vo Trung
- IEIE Transactions on Smart Processing and Computing
- /
- v.5 no.2
- /
- pp.77-84
- /
- 2016
In the field of natural language processing, Universal Networking Language (UNL) has been used by various researchers as an inter-lingual approach to automatic machine translation. The UNL system consists of two main components, namely, EnConverter for converting text from a source language to UNL, and DeConverter for converting from UNL to a target language. Currently, many projects are researching how to apply UNL to different languages. In this paper, we introduce the tools that are UNL's applications and discuss how to reuse them to encode a Vietnamese sentence into UNL expressions and decode UNL expressions into a Vietnamese sentence. The testing was done with about 1,000 Vietnamese sentences (a dictionary that includes 4573 entries and 3161 rules). In addition, we compare the proportion of sentences translated based on a direct method (Google Translator) and another one based on UNL.
https://doi.org/10.5573/IEIESPC.2016.5.2.077 인용 PDF KSCI

A Language Model based on VCCV of Sentence Speech Recognition (문장 음성 인식을 위한 VCCV기반의 언어 모델)

박선희;홍광석
- Proceedings of the IEEK Conference
- /
- 2003.07e
- /
- pp.2419-2422
- /
- 2003
To improve performance of sentence speech recognition systems, we need to consider perplexity of language model and the number of words of dictionary for increasing vocabulary size. In this paper, we propose a language model of VCCV units for sentence speech recognition. For this, we choose VCCV units as a processing units of language model and compare it with clauses and morphemes. Clauses and morphemes have many vocabulary and high perplexity. But VCCV units have small lexicon size and limited vocabulary. An advantage of VCCV units is low perplexity. This paper made language model using bigram about given text. We calculated perplexity of each language processing unit. The perplexity of VCCV units is lower than morpheme and clause.
PDF

Comparative Analysis of Statistical Language Modeling for Korean using K-SLM Toolkits (K-SLM Toolkit을 이용한 한국어의 통계적 언어 모델링 비교)

Lee, Jin-Seok;Park, Jay-Duke;Lee, Geun-Bae
- Annual Conference on Human and Language Technology
- /
- 1999.10e
- /
- pp.426-432
- /
- 1999
통계적 언어 모델은 자연어 처리의 다양한 분야에서 시스템의 정확도를 높이고 수행 시간을 줄여줄 수 있는 중요한 지식원이므로 언어 모델의 성능은 자연어 처리 시스템, 특히 음성 인식 시스템의 성능에 직접적인 영향을 준다. 본 논문에서는 한국어를 위한 통계적 언어 모델을 구축하기 위한 다양한 언어 모델 실험을 제시하고 각 언어 모델들 간의 성능 비교를 통하여 통계적 언어 모델의 표준을 제시한다. 또한 형태소 및 어절 단위의 고 빈도 어휘만을 범용 언어 모델에 적용할 때의 적용률을 통하여 언어 모델 구축시 어휘 사전 크기 결정을 위한 기초적 자료를 제시한다. 본 연구는 음성 인식용 통계적 언어 모델의 성능을 판단하는 데 앞으로 큰 도움을 줄 수 있을 것이다.
PDF

A Concept Language Model combining Word Sense Information and BERT (의미 정보와 BERT를 결합한 개념 언어 모델)

Lee, Ju-Sang;Ock, Cheol-Young
- Annual Conference on Human and Language Technology
- /
- 2019.10a
- /
- pp.3-7
- /
- 2019
자연어 표상은 자연어가 가진 정보를 컴퓨터에게 전달하기 위해 표현하는 방법이다. 현재 자연어 표상은 학습을 통해 고정된 벡터로 표현하는 것이 아닌 문맥적 정보에 의해 벡터가 변화한다. 그 중 BERT의 경우 Transformer 모델의 encoder를 사용하여 자연어를 표상하는 기술이다. 하지만 BERT의 경우 학습시간이 많이 걸리며, 대용량의 데이터를 필요로 한다. 본 논문에서는 빠른 자연어 표상 학습을 위해 의미 정보와 BERT를 결합한 개념 언어 모델을 제안한다. 의미 정보로 단어의 품사 정보와, 명사의 의미 계층 정보를 추상적으로 표현했다. 실험을 위해 ETRI에서 공개한 한국어 BERT 모델을 비교 대상으로 하며, 개체명 인식을 학습하여 비교했다. 두 모델의 개체명 인식 결과가 비슷하게 나타났다. 의미 정보가 자연어 표상을 하는데 중요한 정보가 될 수 있음을 확인했다.
PDF

Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence

Mao, Makara;Peng, Sony;Yang, Yixuan;Park, Doo-Soon
- Journal of Information Processing Systems
- /
- v.18 no.4
- /
- pp.549-561
- /
- 2022
In the Khmer writing system, the Khmer script is the official letter of Cambodia, written from left to right without a space separator; it is complicated and requires more analysis studies. Without clear standard guidelines, a space separator in the Khmer language is used inconsistently and informally to separate words in sentences. Therefore, a segmented method should be discussed with the combination of the future Khmer natural language processing (NLP) to define the appropriate rule for Khmer sentences. The critical process in NLP with the capability of extensive data language analysis necessitates applying in this scenario. One of the essential components in Khmer language processing is how to split the word into a series of sentences and count the words used in the sentences. Currently, Microsoft Word cannot count Khmer words correctly. So, this study presents a systematic library to segment Khmer phrases using the bi-directional maximal matching (BiMM) method to address these problematic constraints. In the BiMM algorithm, the paper focuses on the Bidirectional implementation of forward maximal matching (FMM) and backward maximal matching (BMM) to improve word segmentation accuracy. A digital or prefix tree of data structure algorithm, also known as a trie, enhances the segmentation accuracy procedure by finding the children of each word parent node. The accuracy of BiMM is higher than using FMM or BMM independently; moreover, the proposed approach improves dictionary structures and reduces the number of errors. The result of this study can reduce the error by 8.57% compared to FMM and BFF algorithms with 94,807 Khmer words.
https://doi.org/10.3745/JIPS.04.0250 인용 PDF KSCI

Search Result 2,691, Processing Time 0.032 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)