• Title/Summary/Keyword: 한글 정규화

Search Result 48, Processing Time 0.023 seconds

A Study on Optimization of Support Vector Machine Classifier for Word Sense Disambiguation (단어 중의성 해소를 위한 SVM 분류기 최적화에 관한 연구)

  • Lee, Yong-Gu
    • Journal of Information Management
    • /
    • v.42 no.2
    • /
    • pp.193-210
    • /
    • 2011
  • The study was applied to context window sizes and weighting method to obtain the best performance of word sense disambiguation using support vector machine. The context window sizes were used to a 3-word, sentence, 50-bytes, and document window around the targeted word. The weighting methods were used to Binary, Term Frequency(TF), TF ${\times}$ Inverse Document Frequency(IDF), and Log TF ${\times}$ IDF. As a result, the performance of 50-bytes in the context window size was best. The Binary weighting method showed the best performance.

A Study on Generation Method of Intonation using Peak Parameter and Pitch Lookup-Table (Peak 파라미터와 피치 검색테이블을 이용한 억양 생성방식 연구)

  • Jang, Seok-Bok;Kim, Hyung-Soon
    • Annual Conference on Human and Language Technology
    • /
    • 1999.10e
    • /
    • pp.184-190
    • /
    • 1999
  • 본 논문에서는 Text-to-Speech 시스템에서 사용할 억양 모델을 위해 음성 DB에서 모델 파라미터와 피치 검색테이블(lookup-table)을 추출하여 미리 구성하고, 합성시에는 이를 추정하여 최종 F0 값을 생성하는 자료기반 접근방식(data-driven approach)을 사용한다. 어절 경계강도(break-index)는 경계강도의 특성에 따라 고정적 경계강도와 가변적 경계강도로 세분화하여 사용하였고, 예측된 경계강도를 기준으로 억양구(Intonation Phrase)와 액센트구(Accentual Phrase)를 설정하였다. 특히, 액센트구 모델은 인지적, 음향적으로 중요한 정점(peak)을 정확하게 모델링하는 것에 주안점을 두어 정점(peak)의 시간축, 주파수축 값과 이를 기준으로 한 앞뒤 기울기를 추정하여 4개의 파라미터로 설정하였고, 이 파라미터들은 CART(Classification and Regression Tree)를 이용하여 예측규칙을 만들었다. 경계음조가 나타나는 조사, 어미는 정규화된(normalized) 피치값과 key-index로 구성되는 검색테이블을 만들어 보다 정교하게 피치값을 예측하였다. 본 논문에서 제안한 억양 모델을 본 연구실에서 제작한 음성합성기를 통해 합성하여 청취실험을 거친 결과, 기존의 상용 Text-to-Speech 시스템에 비해 자연스러운 합성음을 얻을 수 있었다.

  • PDF

Distinction of the Korean and English Character Using the Stroke Density (획 밀도를 이용한 한영 구분)

  • Won, Nam-Sik;Jeon, Il-Soo;Lee, Doo-Han
    • The Transactions of the Korea Information Processing Society
    • /
    • v.4 no.7
    • /
    • pp.1873-1880
    • /
    • 1997
  • It is an important factor to distinguish the kind of the character for increasing recognition rate before the character recognition in the document recognition system composed of the multi-font and multi-letters. All the letters of each country have a various unique characteristic in the each composition. In this paper, we used the stroke density as a method to distinguish the letter, and it has been adopted only Korean and English character. Input data is processed by the normalization to adopt multi-font document. Proposed method has been proved by the results of experiment the fact that the distinction probability of the Korean and English is more than 90%.

  • PDF

A Study of the Automatic Extraction of Hypernyms and Hyponyms from the Corpus (코퍼스를 이용한 상하위어 추출 연구)

  • Pang, Chan-Seong
    • Annual Conference on Human and Language Technology
    • /
    • 2007.10a
    • /
    • pp.46-53
    • /
    • 2007
  • 본 연구는 코퍼스 내 어휘들의 상하위 관계를 중심으로 패턴들을 추출하는 방법을 제안한다. 한국어 어순의 자유로움으로 인한 제약으로 주로 사전 뜻풀이말을 중심으로 하였던 패턴 추출 방식에서 벗어나 본 연구는 코퍼스를 이용하여 다양한 패턴들을 제시하고자 하였다. 연구 방법으로는 세종전자 사전을 이용하여 상하위어 쌍들의 목록을 선정한 후 코어넷으로 상하위어 목록을 추가한다. 그리고 이 두 상하위어 목록의 어휘 쌍들을 포함하는 문장들을 코퍼스에서 추출한 후 체계적으로 패턴화 할 수 있는 문장들을 추출하여 21가지 패턴으로 일반화하였다. 21가지 패턴들을 정규식으로 표현한 뒤 각각 동일한 패턴들을 가진 문장들을 코퍼스에서 다시 추출한 결과 57%의 정확률이 측정되었다.

  • PDF

A Study on the Construction of Specialized NER Dataset for Personal Information Detection (개인정보 탐지를 위한 특화 개체명 주석 데이터셋 구축 및 분류 실험)

  • Hyerin Kang;Li Fei;Yejee kang;Seoyoon Park;Yeseul Cho;Hyeonmin Seong;Sungsoon Jang;Hansaem Kim
    • Annual Conference on Human and Language Technology
    • /
    • 2022.10a
    • /
    • pp.185-191
    • /
    • 2022
  • 개인정보에 대한 경각심 및 중요성 증대에 따라 텍스트 내 개인정보를 탐지하는 태스크가 주목받고 있다. 본 연구에서는 개인정보 탐지 및 비식별화를 위한 개인정보 특화 개체명 태그셋 7개를 고안하는 한편 이를 바탕으로 비식별화된 원천 데이터에 가상의 데이터를 대치하고 개체명을 주석함으로써 개인정보 특화 개체명 데이터셋을 구축하였다. 개인정보 분류 실험에는 KR-ELECTRA를 사용하였으며, 실험 결과 일반 개체명 및 정규식 바탕의 규칙 기반 개인정보 탐지 성능과 비교하여 특화 개체명을 활용한 딥러닝 기반의 개인정보 탐지가 더 높은 성능을 보임을 확인하였다.

  • PDF

A Malicious Comments Detection Technique on the Internet using Sentiment Analysis and SVM (감성분석과 SVM을 이용한 인터넷 악성댓글 탐지 기법)

  • Hong, Jinju;Kim, Sehan;Park, Jeawon;Choi, Jaehyun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.2
    • /
    • pp.260-267
    • /
    • 2016
  • The Internet has brought lots of changes to us sharing information mutually. However, as all social symptom have double-sided character, it has serious social problem. Vicious users have been taking advantage of anonymity on the Internet, stating comments aggressively for defamation, personal attacks, privacy violation and more. Malicious comments on the Internet are creating the biggest problem regarding unlawful acts and insults which occur on the Internet. In order to solve the issues, several studies have been done to efficiently manage the comments. However, there are limitations to recognize modified malicious vocabulary in previous research. So, in this paper, we propose a malicious comments detection technique by improving limitation of previous studies. The experimental result has shown accuracy of 87.8% providing higher accuracy as compared to previous studies done.

The FE-MCBP for Recognition of the Tilted New-Type Vehicle License Plate (기울어진 신규차량번호판 인식을 위한 FE-MCBP)

  • Koo, Gun-Seo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.12 no.5
    • /
    • pp.73-81
    • /
    • 2007
  • This paper presents how to recognize the new-type vehicle license plate using multi-link recognizer after extract the features from characters. In order to assist this task, this paper proposed FE-MCBP to recognize each character that got through image preprocess, extract range of vehicle license plate and extract process of each character. FE-MCBP is the recognizer based on the features of the character, The recognizer is employed to identify the new-type vehicle licence plates which have both the hangul and the arabic numeral characters. And its recognition rate is improved 9.7 percent than the back propagation recognizer before. Also it makes use of extract of linear component and region coordinate generation technology to normalize a image of the tilted vehicle license plate. The recognition system of the new-type vehicle license plate make possible recognize a image of the tilted vehicle license plate when using this system. Also, this system can recognize the tilted or imperfect vehicle licence plates.

  • PDF

Methods for Video Caption Extraction and Extracted Caption Image Enhancement (영화 비디오 자막 추출 및 추출된 자막 이미지 향상 방법)

  • Kim, So-Myung;Kwak, Sang-Shin;Choi, Yeong-Woo;Chung, Kyu-Sik
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.4
    • /
    • pp.235-247
    • /
    • 2002
  • For an efficient indexing and retrieval of digital video data, research on video caption extraction and recognition is required. This paper proposes methods for extracting artificial captions from video data and enhancing their image quality for an accurate Hangul and English character recognition. In the proposed methods, we first find locations of beginning and ending frames of the same caption contents and combine those multiple frames in each group by logical operation to remove background noises. During this process an evaluation is performed for detecting the integrated results with different caption images. After the multiple video frames are integrated, four different image enhancement techniques are applied to the image: resolution enhancement, contrast enhancement, stroke-based binarization, and morphological smoothing operations. By applying these operations to the video frames we can even improve the image quality of phonemes with complex strokes. Finding the beginning and ending locations of the frames with the same caption contents can be effectively used for the digital video indexing and browsing. We have tested the proposed methods with the video caption images containing both Hangul and English characters from cinema, and obtained the improved results of the character recognition.