• Title/Summary/Keyword: text segmentation

Search Result 140, Processing Time 0.03 seconds

A Rule-Based Analysis from Raw Korean Text to Morphologically Annotated Corpora

  • Lee, Ki-Yong;Markus Schulze
    • Language and Information
    • /
    • v.6 no.2
    • /
    • pp.105-128
    • /
    • 2002
  • Morphologically annotated corpora are the basis for many tasks of computational linguistics. Most current approaches use statistically driven methods of morphological analysis, that provide just POS-tags. While this is sufficient for some applications, a rule-based full morphological analysis also yielding lemmatization and segmentation is needed for many others. This work thus aims at 〔1〕 introducing a rule-based Korean morphological analyzer called Kormoran based on the principle of linearity that prohibits any combination of left-to-right or right-to-left analysis or backtracking and then at 〔2〕 showing how it on be used as a POS-tagger by adopting an ordinary technique of preprocessing and also by filtering out irrelevant morpho-syntactic information in analyzed feature structures. It is shown that, besides providing a basis for subsequent syntactic or semantic processing, full morphological analyzers like Kormoran have the greater power of resolving ambiguities than simple POS-taggers. The focus of our present analysis is on Korean text.

  • PDF

An Efficient Text Detection Model using Bidirectional Feature Fusion (양방향 특징 결합을 이용한 효율적 문자 탐지 모델)

  • Lim, Seong-Taek;Choi, Hoeryeon;Lee, Hong-Chul
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2021.07a
    • /
    • pp.67-68
    • /
    • 2021
  • 기존 객체탐지는 경계 상자 회귀방식을 적용하였지만, 문자는 왜곡과 변형이 심한 특성을 가진 객체로 U-net 구조의 이미지 분할 방식을 사용하는 경우가 많다. 따라서 최근 문자 탐지는 통계적 모델에 비해 높은 정확도를 보이는 심층 신경망 기반의 모델 연구가 많이 진행되고 있다. 본 연구에서는 이미지 분할을 통한 양방향 특징 결합 기법을 사용한 문자 탐지 모델을 제안한다. 이미지 분할 방식은 메모리의 효율이 떨어지기 때문에 이를 극복하고자 특징 추출 단계에서 경량화된 네트워크를 적용하였다. 또한, 객체 탐지에서 큰 성과를 보인 양방향 특징 결합 모듈을 U-net 구조에 추가하여 추출된 특징이 효과적으로 결합 되는 결과를 얻었다. 제안하는 모델의 문자 탐지 성능은 합성 문자 데이터셋을 이용한 실험을 통해 기존의 U-net 구조의 이미지 분할 방식보다 향상되었음을 확인하였다.

  • PDF

Automatic Word-Segmentation at Line-Breaks for Korean Text Processing (한국어 텍스트 처리를 위한 줄 경계 띄어쓰기 복원)

  • 정영미;이재윤
    • Proceedings of the Korean Society for Information Management Conference
    • /
    • 1999.08a
    • /
    • pp.21-24
    • /
    • 1999
  • 한국어 텍스트의 줄 경계에서의 띄어쓰기 복원을 위해 음절쌍 통계를 이용한 복원 기법을 설계하고 신문기사를 대상으로 통계 정보원과 음절쌍 위치에 따른 가중치를 달리하는 실험을 수행하였다. 실험 결과 처리 대상 기사를 포함하는 1개월 분 기사를 통계 정보원으로 하고 가중치는 균등하게 할 때 가장 높은 성공률을 얻었다. 이 결과는 디지털 원문을 텍스트 방식으로 소급하여 구축하는 경우에 적용될 수 있을 것이다.

  • PDF

Temporal Segmentation of Mobile Text Message (시간정보에 기반한 핸드폰 문자의 대화 구분)

  • Jung, Hun-Young;Lee, Jong-Hyeok
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06b
    • /
    • pp.306-308
    • /
    • 2012
  • 핸드폰 사용이 보편화되고 핸드폰의 문자 사용량이 늘어감에 따라 대량의 핸드폰 문자 메시지를 구축하는 건이 가능해졌다. 이러한 문자 데이터를 처리에 기반이 되는 대화 구분 방법을 제안하였다. 이 방법론은 기존 문서분류 방식을 적용하는데 발생하는 문제를 피하기 위해 시간정보를 사용하는 비지도학습 방법론이다. 해당 방법을 실제 핸드폰 메시지 데이터에 적용한 결과 정확율과 재현율에서 0.9를 넘는 높은 성능을 보였다.

Text segmentation using concept hierarchy tree (계층적 개념 트리를 이용한 문서 분할 기법)

  • 이병희;최익규;박승규;김인구
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.10a
    • /
    • pp.166-168
    • /
    • 2003
  • 문서 분할 기법은 문서 내에 존재하는 다양한 주제들을 자동적으로 추출하는 기법이다. 이 분야의 연구는 크게 사전적 관계에 근거한 기법과 통계적 데이터에 근거한 기법으로 나누어져 연구되어 왔다. 사전적 관계에 의한 기법은 단어들의 사전적 의미와 관계에 근거한 기법이고 통계적 데이터에 의한 기법은 주로 단어들의 분포를 이용한 기법이다. 여기에는 몇가지 문제점이 있는데 사전적 관계에 근거한 경우에는 분산된 주제들을 통합하여 추출하기 어렵고. 통계적 데이터에 근거한 기법은 정확한 주제의 개수를 찾기 어렵다는 점이다. 본 논문에서는 계층적 개념 트리를 이용하여 보다 정확한 개수의 주제들을 찾아낼 수 있는 문서 분할 기법에 대해 소개 하고자 한다.

  • PDF

Korean Base-Noun Extraction and its Application (한국어 기준명사 추출 및 그 응용)

  • Kim, Jae-Hoon
    • The KIPS Transactions:PartB
    • /
    • v.15B no.6
    • /
    • pp.613-620
    • /
    • 2008
  • Noun extraction plays an important part in the fields of information retrieval, text summarization, and so on. In this paper, we present a Korean base-noun extraction system and apply it to text summarization to deal with a huge amount of text effectively. The base-noun is an atomic noun but not a compound noun and we use tow techniques, filtering and segmenting. The filtering technique is used for removing non-nominal words from text before extracting base-nouns and the segmenting technique is employed for separating a particle from a nominal and for dividing a compound noun into base-nouns. We have shown that both of the recall and the precision of the proposed system are about 89% on the average under experimental conditions of ETRI corpus. The proposed system has applied to Korean text summarization system and is shown satisfactory results.

Detection of Number and Character Area of License Plate Using Deep Learning and Semantic Image Segmentation (딥러닝과 의미론적 영상분할을 이용한 자동차 번호판의 숫자 및 문자영역 검출)

  • Lee, Jeong-Hwan
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.1
    • /
    • pp.29-35
    • /
    • 2021
  • License plate recognition plays a key role in intelligent transportation systems. Therefore, it is a very important process to efficiently detect the number and character areas. In this paper, we propose a method to effectively detect license plate number area by applying deep learning and semantic image segmentation algorithm. The proposed method is an algorithm that detects number and text areas directly from the license plate without preprocessing such as pixel projection. The license plate image was acquired from a fixed camera installed on the road, and was used in various real situations taking into account both weather and lighting changes. The input images was normalized to reduce the color change, and the deep learning neural networks used in the experiment were Vgg16, Vgg19, ResNet18, and ResNet50. To examine the performance of the proposed method, we experimented with 500 license plate images. 300 sheets were used for learning and 200 sheets were used for testing. As a result of computer simulation, it was the best when using ResNet50, and 95.77% accuracy was obtained.

The Error Pattern Analysis of the HMM-Based Automatic Phoneme Segmentation (HMM기반 자동음소분할기의 음소분할 오류 유형 분석)

  • Kim Min-Je;Lee Jung-Chul;Kim Jong-Jin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.25 no.5
    • /
    • pp.213-221
    • /
    • 2006
  • Phone segmentation of speech waveform is especially important for concatenative text to speech synthesis which uses segmented corpora for the construction of synthetic units. because the quality of synthesized speech depends critically on the accuracy of the segmentation. In the beginning. the phone segmentation was manually performed. but it brings the huge effort and the large time delay. HMM-based approaches adopted from automatic speech recognition are most widely used for automatic segmentation in speech synthesis, providing a consistent and accurate phone labeling scheme. Even the HMM-based approach has been successful, it may locate a phone boundary at a different position than expected. In this paper. we categorized adjacent phoneme pairs and analyzed the mismatches between hand-labeled transcriptions and HMM-based labels. Then we described the dominant error patterns that must be improved for the speech synthesis. For the experiment. hand labeled standard Korean speech DB from ETRI was used as a reference DB. Time difference larger than 20ms between hand-labeled phoneme boundary and auto-aligned boundary is treated as an automatic segmentation error. Our experimental results from female speaker revealed that plosive-vowel, affricate-vowel and vowel-liquid pairs showed high accuracies, 99%, 99.5% and 99% respectively. But stop-nasal, stop-liquid and nasal-liquid pairs showed very low accuracies, 45%, 50% and 55%. And these from male speaker revealed similar tendency.

Block Classification of Document Images by Block Attributes and Texture Features (블록의 속성과 질감특징을 이용한 문서영상의 블록분류)

  • Jang, Young-Nae;Kim, Joong-Soo;Lee, Cheol-Hee
    • Journal of Korea Multimedia Society
    • /
    • v.10 no.7
    • /
    • pp.856-868
    • /
    • 2007
  • We propose an effective method for block classification in a document image. The gray level document image is converted to the binary image for a block segmentation. This binary image would be smoothed to find the locations and sizes of each block. And especially during this smoothing, the inner block heights of each block are obtained. The gray level image is divided to several blocks by these location informations. The SGLDM(spatial gray level dependence matrices) are made using the each gray-level document block and the seven second-order statistical texture features are extracted from the (0,1) direction's SGLDM which include the document attributes. Document image blocks are classified to two groups, text and non-text group, by the inner block height of the block at the nearest neighbor rule. The seven texture features(that were extracted from the SGLDM) are used for the five detail categories of small font, large font, table, graphic and photo blocks. These document blocks are available not only for structure analysis of document recognition but also the various applied area.

  • PDF

Development an Android based OCR Application for Hangul Food Menu (한글 음식 메뉴 인식을 위한 OCR 기반 어플리케이션 개발)

  • Lee, Gyu-Cheol;Yoo, Jisang
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.5
    • /
    • pp.951-959
    • /
    • 2017
  • In this paper, we design and implement an Android-based Hangul food menu recognition application that recognizes characters from images captured by a smart phone. Optical Character Recognition (OCR) technology is divided into preprocessing, recognition and post-processing. In the preprocessing process, the characters are extracted using Maximally Stable Extremal Regions (MSER). In recognition process, Tesseract-OCR, a free OCR engine, is used to recognize characters. In the post-processing process, the wrong result is corrected by using the dictionary DB for the food menu. In order to evaluate the performance of the proposed method, experiments were conducted to compare the recognition performance using the actual menu plate as the DB. The recognition rate measurement experiment with OCR Instantly Free, Text Scanner and Text Fairy, which is a character recognizing application in Google Play Store, was conducted. The experimental results show that the proposed method shows an average recognition rate of 14.1% higher than other techniques.