Search | Korea Science

Parallel Corpus Filtering and Korean-Optimized Subword Tokenization for Machine Translation (병렬 코퍼스 필터링과 한국어에 최적화된 서브 워드 분절 기법을 이용한 기계번역)

Park, Chanjun;kim, Gyeongmin;Lim, Heuiseok
- Annual Conference on Human and Language Technology
- /
- 2019.10a
- /
- pp.221-224
- /
- 2019
딥러닝을 이용한 Neural Machine Translation(NMT)의 등장으로 기계번역 분야에서 기존의 규칙 기반,통계기반 방식을 압도하는 좋은 성능을 보이고 있다. 본 논문은 기계번역 모델도 중요하지만 무엇보다 중요한 것은 고품질의 학습데이터를 구성하는 일과 전처리라고 판단하여 이에 관련된 다양한 실험을 진행하였다. 인공신경망 기계번역 시스템의 학습데이터 즉 병렬 코퍼스를 구축할 때 양질의 데이터를 확보하는 것이 무엇보다 중요하다. 그러나 양질의 데이터를 구하는 일은 저작권 확보의 문제, 병렬 말뭉치 구축의 어려움, 노이즈 등을 이유로 쉽지 않은 상황이다. 본 논문은 고품질의 학습데이터를 구축하기 위하여 병렬 코퍼스 필터링 기법을 제시한다. 병렬 코퍼스 필터링이란 정제와 다르게 학습 데이터에 부합하지 않다고 판단되며 소스, 타겟 쌍을 함께 삭제 시켜 버린다. 또한 기계번역에서 무엇보다 중요한 단계는 바로 Subword Tokenization 단계이다. 본 논문은 다양한 실험을 통하여 한-영 기계번역에서 가장 높은 성능을 보이는 Subword Tokenization 방법론을 제시한다. 오픈 된 한-영 병렬 말뭉치로 실험을 진행한 결과 병렬 코퍼스 필터링을 진행한 데이터로 만든 모델이 더 좋은 BLEU 점수를 보였으며 본 논문에서 제안하는 형태소 분석 단위 분리를 진행 후 Unigram이 반영된 SentencePiece 모델로 Subword Tokenization를 진행 하였을 시 가장 좋은 성능을 보였다.
PDF

A Bidirectional Korean-Japanese Statistical Machine Translation System by Using MOSES (MOSES를 이용한 한/일 양방향 통계기반 자동 번역 시스템)

Lee, Kong-Joo;Lee, Song-Wook;Kim, Jee-Eun
- Journal of Advanced Marine Engineering and Technology
- /
- v.36 no.5
- /
- pp.683-693
- /
- 2012
Recently, statistical machine translation (SMT) has received many attention with ease of its implementation and maintenance. The goal of our works is to build bidirectional Korean-Japanese SMT system by using MOSES [1] system. We use Korean-Japanese bilingual corpus which is aligned per sentence to train the translation model and use a large raw corpus in each language to train each language model. The proposed system shows results comparable to those of a rule-based machine translation system. Most of errors are caused by noises occurred in each processing stage.
https://doi.org/10.5916/jkosme.2012.36.5.683 인용 PDF KSCI

Modelling Grammatical Pattern Acquisition using Video Scripts (비디오 스크립트를 이용한 문법적 패턴 습득 모델링)

Seok, Ho-Sik;Zhang, Byoung-Tak
- Annual Conference on Human and Language Technology
- /
- 2010.10a
- /
- pp.127-129
- /
- 2010
본 논문에서는 다양한 코퍼스를 통해 언어를 학습하는 과정을 모델링하여 무감독학습(Unsupervised learning)으로 문법적 패턴을 습득하는 방법론을 소개한다. 제안 방법에서는 적은 수의 특성 조합으로 잠재적 패턴의 부분만을 표현한 후 표현된 규칙을 조합하여 유의미한 문법적 패턴을 탐색한다. 본 논문에서 제안한 방법은 베이지만 추론(Bayesian Inference)과 MCMC (Markov Chain Mote Carlo) 샘플링에 기반하여 특성 조합을 유의미한 문법적 패턴으로 정제하는 방법으로, 랜덤하이퍼그래프(Random Hypergraph) 모델을 이용하여 많은 수의 하이퍼에지를 생성한 후 생성된 하이퍼에지의 가중치를 조정하여 유의미한 문법적 패턴을 탈색하는 방법론이다. 우리는 본 논문에서 유아용 비디오의 스크립트를 이용하여 다양한 유아용 비디오 스크립트에서 문법적 패턴을 습득하는 방법론을 소개한다.
PDF

A Transformation based Sentence Splitting method for Statistical Machine Translation (통계적 기계번역을 위한 변환 기반 문장 분할 방법)

Lee, Jongoon;Lee, Donghyeon;Lee, Gary Geunbae
- Annual Conference on Human and Language Technology
- /
- 2007.10a
- /
- pp.276-281
- /
- 2007
최근 활발하게 연구 되고 있는 통계 기반의 기계 번역 시스템에서는 입력 문장이 길어지면 번역 성능이 떨어지는 현상이 나타난다. 이를 완화하기 위해 긴 문장을 같은 의미의 짧은 문장들로 분할하여 각각 번역하면 기계 번역 성능을 향상 시킬 수 있다. 본 논문에서는 통계적 기계 번역을 위한 변환 기반의 문장 분할 방법을 제안한다. 변환 기반의 문장 분할 방법은 사람이 직접 분할한 예문으로부터 변환 규칙을 학습하여 기계 번역의 입력 문장에 적용함으로써 구절 기반의 통계적 기계 번역 성능을 최대화 한다.
PDF

Service Selection Engine for Human-care Service Robot Based on a Hierarchical Multimodal Knowledge (휴먼케어 서비스 로봇을 위한 계층적 복합 지식 기반 서비스 선택 엔진)

Jang, Choulsoo;Jang, Minsu;Lee, Jaeyeon
- Proceedings of the Korea Information Processing Society Conference
- /
- 2018.10a
- /
- pp.896-899
- /
- 2018
고령사회에 대응하기 위한 휴먼케어 서비스 로봇은 다양한 동적 환경에서 사용자에게 최적의 서비스를 제공하기 위해 서비스 선택 엔진을 요구한다. 서비스 선택 엔진은 로봇이 수집한 각종 원시 데이터를 활용하여 계층적으로 상위 수준의 정보로 가공하고 최종 단계에서는 휴먼케어 전문가가 설계한 규칙에 의해 사용자에게 제공할 서비스를 선택한다. 본 논문에서는 휴먼케어 서비스 로봇을 위해 기계학습 기반의 지식 생성과 규칙 기반의 지식 생성을 함께 활용하여 하이브리드 형태로 계층적 지식을 생성하고, 생성된 지식을 바탕으로 서비스를 선택하는 메커니즘을 제공할 수 있는 서비스를 선택 엔진 내용을 설명한다.
https://doi.org/10.3745/PKIPS.y2018m10a.896 인용 PDF

P2P Traffic Classification using Advanced Heuristic Rules and Analysis of Decision Tree Algorithms (개선된 휴리스틱 규칙 및 의사 결정 트리 분석을 이용한 P2P 트래픽 분류 기법)

Ye, Wujian;Cho, Kyungsan
- Journal of the Korea Society of Computer and Information
- /
- v.19 no.3
- /
- pp.45-54
- /
- 2014
In this paper, an improved two-step P2P traffic classification scheme is proposed to overcome the limitations of the existing methods. The first step is a signature-based classifier at the packet-level. The second step consists of pattern heuristic rules and a statistics-based classifier at the flow-level. With pattern heuristic rules, the accuracy can be improved and the amount of traffic to be classified by statistics-based classifier can be reduced. Based on the analysis of different decision tree algorithms, the statistics-based classifier is implemented with REPTree. In addition, the ensemble algorithm is used to improve the performance of statistics-based classifier Through the verification with the real datasets, it is shown that our hybrid scheme provides higher accuracy and lower overhead compared to other existing schemes.
https://doi.org/10.9708/jksci.2014.19.3.045 인용 PDF KSCI

Korean and English Text Chunking Using IG Back-off Smoothing and Probabilistic Model (IG back-off 평탄화와 확률 기반 모델을 이용한 한국어 및 영어 단위화)

Yi, Eun-Ji;Lee, Geun-Bae
- Annual Conference on Human and Language Technology
- /
- 2002.10e
- /
- pp.118-123
- /
- 2002
많은 자연언어처리 분야에서 문장의 단위화는 기본적인 처리 단계로서 중요한 위치를 차지하고 있다. 한국어 단위화에 대한 기존 연구들은 규칙 기반 방법이나 기계 학습 기법을 이용한 것이 대부분이었다. 본 논문에서는 통계 기반 방식의 일환으로 순수 확률기반 모델을 이용한 단위화 방법을 제시한다. 확률 기반 모델은 처리하고자 하는 해당 언어에 대한 깊은 지식 없이도 적용 가능하다는 장점을 가지므로 다양한 언어의 단위화에 대한 기본 모델로서 이용될 수 있다. 또한 자료 부족 문제를 해결하기 위해 메모리 기반 학습 시에 사용하는 IG back-off 평탄화 방식을 시스템에 적용하였다. 본 논문의 모텔을 적용한 단위화 시스템을 이용하여 한국어와 영어에 대해 실험한 결과 비교적 작은 규모의 말뭉치를 학습하였음에도 불구하고 각각 90.0%, 90.0%의 정확도를 보였다.
PDF

Web Page Classification System based upon Ontology (온톨로지 기반의 웹 페이지 분류 시스템)

Choi Jaehyuk;Seo Haesung;Noh Sanguk;Choi Kyunghee;Jung Gihyun
- The KIPS Transactions:PartB
- /
- v.11B no.6
- /
- pp.723-734
- /
- 2004
In this paper, we present an automated Web page classification system based upon ontology. As a first step, to identify the representative terms given a set of classes, we compute the product of term frequency and document frequency. Secondly, the information gain of each term prioritizes it based on the possibility of classification. We compile a pair of the terms selected and a web page classification into rules using machine learning algorithms. The compiled rules classify any Web page into categories defined on a domain ontology. In the experiments, 78 terms out of 240 terms were identified as representative features given a set of Web pages. The resulting accuracy of the classification was, on the average, 83.52%.
https://doi.org/10.3745/KIPSTB.2004.11B.6.723 인용 PDF KSCI

Rule Discovery for Cancer Classification using Genetic Programming based on Arithmetic Operators (산술 연산자 기반 유전자 프로그래밍을 이용한 암 분류 규칙 발견)

홍진혁;조성배
- Journal of KIISE:Software and Applications
- /
- v.31 no.8
- /
- pp.999-1009
- /
- 2004
As a new approach to the diagnosis of cancers, bioinformatics attracts great interest these days. Machine teaming techniques have produced valuable results, but the field of medicine requires not only highly accurate classifiers but also the effective analysis and interpretation of them. Since gene expression data in bioinformatics consist of tens of thousands of features, it is nearly impossible to represent their relations directly. In this paper, we propose a method composed of a feature selection method and genetic programming. Rank-based feature selection is adopted to select useful features and genetic programming based arithmetic operators is used to generate classification rules with features selected. Experimental results on Lymphoma cancer dataset, in which the proposed method obtained 96.6% test accuracy as well as useful classification rules, have shown the validity of the proposed method.
PDF KSCI

Extracting Rules from Neural Networks with Continuous Attributes (연속형 속성을 갖는 인공 신경망의 규칙 추출)

Jagvaral, Batselem;Lee, Wan-Gon;Jeon, Myung-joong;Park, Hyun-Kyu;Park, Young-Tack
- Journal of KIISE
- /
- v.45 no.1
- /
- pp.22-29
- /
- 2018
Over the decades, neural networks have been successfully used in numerous applications from speech recognition to image classification. However, these neural networks cannot explain their results and one needs to know how and why a specific conclusion was drawn. Most studies focus on extracting binary rules from neural networks, which is often impractical to do, since data sets used for machine learning applications contain continuous values. To fill the gap, this paper presents an algorithm to extract logic rules from a trained neural network for data with continuous attributes. It uses hyperplane-based linear classifiers to extract rules with numeric values from trained weights between input and hidden layers and then combines these classifiers with binary rules learned from hidden and output layers to form non-linear classification rules. Experiments with different datasets show that the proposed approach can accurately extract logical rules for data with nonlinear continuous attributes.
https://doi.org/10.5626/JOK.2018.45.1.22 인용 KSCI

Search Result 92, Processing Time 0.031 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)