Search | Korea Science

Quality, not Quantity? : Effect of parallel corpus quantity and quality on Neural Machine Translation (양보다 질? : 병렬 말뭉치의 양과 질이 인공신경망 기계번역에 미치는 효과)

Park, Chanjun;Lee, Yeonsu;Lee, Chanhee;Lim, Heuiseok
- Annual Conference on Human and Language Technology
- /
- 2020.10a
- /
- pp.363-368
- /
- 2020
글로벌 시대를 맞이하여 언어의 장벽을 해소하기 위하여 기계번역 연구들이 전 세계적으로 이루어지고 있다. 딥러닝의 등장으로 기존 규칙 및 통계기반 방법론에 비하여 눈에 띄는 성능향상을 이루어내고 있으며 많은 연구들이 이루어지고 있다. 인공신경망 기반 기계번역 모델을 만들 때 가장 중요한 요소는 병렬 말뭉치의 양과 질이다. 본 논문은 한-영 대용량의 말뭉치를 수집하고 병렬 말뭉치 필터링 기법을 적용하여 데이터의 양과 질을 충족시켰으며 한-영 기계번역 관련 객관적인 테스트셋인 Iwslt 16, Iwslt 17을 기준으로 기존 한-영 기계번역 관련 연구 중 가장 좋은 성능을 보였다.
PDF

Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering (병렬 말뭉치 필터링을 적용한 Filter-mBART기반 기계번역 연구)

Moon, Hyeonseok;Park, Chanjun;Eo, Sugyeong;Park, JeongBae;Lim, Heuiseok
- Journal of the Korea Convergence Society
- /
- v.12 no.5
- /
- pp.1-7
- /
- 2021
In the latest trend of machine translation research, the model is pretrained through a large mono lingual corpus and then finetuned with a parallel corpus. Although many studies tend to increase the amount of data used in the pretraining stage, it is hard to say that the amount of data must be increased to improve machine translation performance. In this study, through an experiment based on the mBART model using parallel corpus filtering, we propose that high quality data can yield better machine translation performance, even utilizing smaller amount of data. We propose that it is important to consider the quality of data rather than the amount of data, and it can be used as a guideline for building a training corpus.
https://doi.org/10.15207/JKCS.2021.12.5.001 인용 PDF KSCI

A Study on Building Korean Dialogue Corpus for Punctuation and Quotation Mark Filling (문장 부호 자동 완성을 위한 한국어 말뭉치 구축 연구)

Han, Seunggyu;Yang, Kisu;Lim, HeuiSeok
- Annual Conference on Human and Language Technology
- /
- 2019.10a
- /
- pp.475-477
- /
- 2019
문장 부호란, 글에서 문장의 구조를 잘 드러내거나 글쓴이의 의도를 쉽게 전달하기 위하여 사용되는 부호들로, 따옴표나 쉼표, 마침표 등이 있다. 대화 시스템과 같이 컴퓨터가 생성해 낸 문장을 인간이 이해해야 하는 경우나 음성 인식(Speech-To-Text) 결과물의 품질을 향상시키기 위해서는, 문장 부호의 올바른 삽입이 필요하다. 본 논문에서는 이를 수행하는 딥 러닝 기반 모델을 훈련할 때 필요로 하는 한국어 말뭉치를 구축한 내용을 소개한다. 이 말뭉치는 대한민국정부에서 장관급 이상이 발언한 각종 연설문에서 적절한 기준을 통해 선별된 고품질의 문장으로 구성되어 있다. 문장의 총 개수는 126,795개이고 1,633,817개의 단어들(조사는 합쳐서 한 단어로 계산한다)로 구성되어 있다. 마침표와 쉼표는 각각 121,256개, 67,097개씩이다.
PDF

A study on performance improvement considering the balance between corpus in Neural Machine Translation (인공신경망 기계번역에서 말뭉치 간의 균형성을 고려한 성능 향상 연구)

Park, Chanjun;Park, Kinam;Moon, Hyeonseok;Eo, Sugyeong;Lim, Heuiseok
- Journal of the Korea Convergence Society
- /
- v.12 no.5
- /
- pp.23-29
- /
- 2021
Recent deep learning-based natural language processing studies are conducting research to improve performance by training large amounts of data from various sources together. However, there is a possibility that the methodology of learning by combining data from various sources into one may prevent performance improvement. In the case of machine translation, data deviation occurs due to differences in translation(liberal, literal), style(colloquial, written, formal, etc.), domains, etc. Combining these corpora into one for learning can adversely affect performance. In this paper, we propose a new Corpus Weight Balance(CWB) method that considers the balance between parallel corpora in machine translation. As a result of the experiment, the model trained with balanced corpus showed better performance than the existing model. In addition, we propose an additional corpus construction process that enables coexistence with the human translation market, which can build high-quality parallel corpus even with a monolingual corpus.
https://doi.org/10.15207/JKCS.2021.12.5.023 인용 PDF KSCI

An Improvement of the Learning Speed through Considered Distance on Jul-Gonu Game (거리를 고려한 줄고누게임의 학습속도 개선)

Shin, Yong-Woo;Chung, Tae-Choong
- Journal of Korea Game Society
- /
- v.10 no.1
- /
- pp.105-113
- /
- 2010
It takes quite amount of time to study a board game because there are many game characters and different stages are exist for board games. Also, the opponent is not just a single character that means it is not one on one game, but group vs. group. That is why strategy is needed, and therefore applying optimum learning is a must. If there were equal result that both are considered to be best ones during the course of learning stage, Heuristic which utilizes learning of problem area of Jul-Gonu was used to improve the speed of learning. To compare a normal character to an improved one, a jul-gonu game was created, and then they fought against each other. Improved character considered distance and attacked other one. As a result, improved character's ability was improved on learning speed.
PDF KSCI

Automatic Generation of Training Data for Korean Speech Recognition Post-Processor (한국어 음성인식 후처리기를 위한 학습 데이터 자동 생성 방안)

Seonmin Koo;Chanjun Park;Hyeonseok Moon;Jaehyung Seo;Sugyeong Eo;Yuna Hur;Heuiseok Lim
- Annual Conference on Human and Language Technology
- /
- 2022.10a
- /
- pp.465-469
- /
- 2022
자동 음성 인식 (Automatic Speech Recognition) 기술이 발달함에 따라 자동 음성 인식 시스템의 성능을 높이기 위한 방법 중 하나로 자동 후처리기 연구(automatic post-processor)가 진행되어 왔다. 후처리기를 훈련시키기 위해서는 오류 유형이 포함되어 있는 병렬 말뭉치가 필요하다. 이를 만드는 간단한 방법 중 하나는 정답 문장에 오류를 삽입하여 오류 문장을 생성하여 pseudo 병렬 말뭉치를 만드는 것이다. 하지만 이는 실제적인 오류가 아닐 가능성이 존재한다. 이를 완화시키기 위하여 Back TranScription (BTS)을 이용하여 후처리기 모델 훈련을 위한 병렬 말뭉치를 생성하는 방법론이 존재한다. 그러나 해당 방법론으로 생성 할 경우 노이즈가 적을 수 있다는 관점이 존재하다. 이에 본 연구에서는 BTS 방법론과 인위적으로 노이즈 강도를 추가한 방법론 간의 성능을 비교한다. 이를 통해 BTS의 정량적 성능이 가장 높은 것을 확인했을 뿐만 아니라 정성적 분석을 통해 BTS 방법론을 활용하였을 때 실제 음성 인식 상황에서 발생할 수 있는 실제적인 오류를 더 많이 포함하여 병렬 말뭉치를 생성할 수 있음을 보여준다.
PDF

Automated Generation of Word Balloons in Comics (만화 영상에서 말풍선의 자동 생성 방법)

Ryu, Dong-Sung;Chun, Bong-Kyung;Park, Kyu-Tae;Cho, Hwan-Gue
- Journal of the Korea Computer Graphics Society
- /
- v.13 no.1
- /
- pp.33-36
- /
- 2007
Generally, word balloon have played a role to connect the script with character in comics. The location of word balloons depicts the process of story in comics, because they are located by reading order. Therefore, it is very Important works to generate and place word balloons, these work usually is processed manually by comic writer's. In this paper, we discuss the automated generation and placement of word balloon. For this, we modeled 6 kinds of word balloons. And these word balloons are placed by heuristic method based on EPFLP. We also generate the tail of word balloon automatically by considering the direction and reference points of word balloon.
PDF

Q-learning to improve learning speed using Minimax algorithm (미니맥스 알고리즘을 이용한 학습속도 개선을 위한 Q러닝)

Shin, YongWoo
- Journal of Korea Game Society
- /
- v.18 no.4
- /
- pp.99-106
- /
- 2018
Board games have many game characters and many state spaces. Therefore, games must be long learning. This paper used reinforcement learning algorithm. But, there is weakness with reinforcement learning. At the beginning of learning, reinforcement learning has the drawback of slow learning speed. Therefore, we tried to improve the learning speed by using the heuristic using the knowledge of the problem domain considering the game tree when there is the same best value during learning. In order to compare the existing character the improved one. I produced a board game. So I compete with one-sided attacking character. Improved character attacked the opponent's one considering the game tree. As a result of experiment, improved character's capability was improved on learning speed.
https://doi.org/10.7583/JKGS.2018.18.4.99 인용 PDF KSCI

A Study on the enforcement for Driving Under the Influence (주취운전 단속에 관한 논의)

Kang, maeng-jin
- Proceedings of the Korea Contents Association Conference
- /
- 2016.05a
- /
- pp.119-120
- /
- 2016
한국의 도로교통법에는 주취상태에서 운전을 하면 안 된다는 규정이 있다. 그런데 주취운전이라는 말보다 음주운전이라는 말이 더 널리 쓰이는 실정이다. 음주운전 역시 말 그대로 술을 마신 상태에서 하는 운전을 말한다. 우리나라를 비롯한 모든 나라에서는 주취운전의 위험성을 고려하여 이에 대한 단속 기준을 제시하고 있다. 한국은 혈중알콜농도를 확인하는데, 0.05를 단속 기준으로 삼고 있으며 현재 단속기준에 대한 논의가 이루어지고 있다. 경찰은 현재의 음주운전 단속 기준을 0.03%로 강화하는 것에 대하여 의견을 수렴 중이다.
PDF

A Korean POS Tagging System with Handling Corpus Errors (말뭉치 오류를 고려한 HMM 한국어 품사 태깅 시스템)

Seol, Yong-Soo;Kim, Dong-Joo;Kim, Kyu-Sang;Kim, Han-Woo
- KSCI Review
- /
- v.15 no.1
- /
- pp.117-124
- /
- 2007
통계 기반 접근 방법을 이용한 품사태깅에서 태깅 정확도는 훈련 데이터의 양에 좌우될 뿐 아니라, 말뭉치가 충분할지라도 수작업으로 구축한 말뭉치의 경우 항상 오류의 가능성을 내포하고 있으며 언어의 특성상 통계적으로 신뢰할만한 데이터의 수집에도 어려움이 따른다. 훈련 데이터로 사용되는 말뭉치는 많은 사람들이 수작업으로 구축하므로 작업자 중 일부가 언어에 대한 지식이 부족하다거나 주관적인 판단에 의한 태깅 실수를 포함할 수도 있기 때문에 단순한 저빈도와 관련된 잡음 외의 오류들이 포함될 수 있는데 이러한 오류들은 재추정이나 평탄화 기법으로 해결될 수 있는 문제가 아니다. 본 논문에서는 HMM(Hidden Markov Model)을 이용한 한국어 품사 태깅에서 재추정 후 여전히 존재하는 말뭉치의 잡음에 인한 태깅 오류 해결을 위해 비터비 알고리즘적용 단계에서 데이터 부족과 말뭉치의 오류로 인해 문제가 되는 부분을 찾아내고 규칙을 통해 수정을 하여 태깅 결과를 개선하는 방안을 제안한다. 실험결과는 오류가 존재하는 말뭉치를 사용하여 구현된 HMM과 비터비 알고리즘을 적용한 태깅 정확도에 비해 오류를 수정하는 과정을 거친 후 정확도가 향상됨을 보여준다.
PDF

Search Result 1,139, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)