• 제목/요약/키워드: Word segmentation

검색결과 135건 처리시간 0.017초

A Methodology for Urdu Word Segmentation using Ligature and Word Probabilities

  • Khan, Yunus;Nagar, Chetan;Kaushal, Devendra S.
    • International Journal of Ocean System Engineering
    • /
    • 제2권1호
    • /
    • pp.24-31
    • /
    • 2012
  • This paper introduce a technique for Word segmentation for the handwritten recognition of Urdu script. Word segmentation or word tokenization is a primary technique for understanding the sentences written in Urdu language. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A method is proposed for word segmentation in this paper. It finds the boundaries of words in a sequence of ligatures using probabilistic formulas, by utilizing the knowledge of collocation of ligatures and words in the corpus. The word identification rate using this technique is 97.10% with 66.63% unknown words identification rate.

분할기반 은닉 마르코프 모델과 다층 퍼셉트론 결합 영문수표필기단어 인식시스템 (A Segmentation-Based HMM and MLP Hybrid Classifier for English Legal Word Recognition)

  • 김계경;김진호;박희주
    • 한국지능시스템학회논문지
    • /
    • 제11권3호
    • /
    • pp.200-207
    • /
    • 2001
  • 본 논문에서는 분할기반 은닉 마르코프 모델(segmentation based hidden Markov model)과 다층 퍼셉트론 (multi-layer perceptron)을 결합한 영문수표 필기단어 (legal word) 인식시스템을 제안하였다. 가변길이의 필기체 영문 단어 분할결과를 인식할 수 있도록 은닉 마르코프 모델을 이용하여 명확한 분할기반 (explicit segmentation-based) 단어단위 (word level) 인식기를 구현하고 다층 퍼셉트론을 이용하여 내재적 분할기반 (implicit segmentation-based) 단어단위 인식기를 구현하였다. 그리고 이종(heterogeneous)의 두 인식기를 새로운 결합 확률추정방식에 따라 결합함으로서 상호 보완 능력을 극대화시킬 수 있는 영문수표 필기단어 인식시스템을 구현하였다. 제안한 시스템을 캐나다 콘코디아 대학의 CENPARMI 영문 수표 데이터베이스에 적용하여 실험해 본 결과 기존의 연구결과에 비해 비교적 우수한 인식성능을 얻을 수 있었다.

  • PDF

영어의 강음절(강세 음절)과 한국어 화자의 단어 분절 (Strong (stressed) syllables in English and lexical segmentation by Koreans)

  • 김선미;남기춘
    • 말소리와 음성과학
    • /
    • 제3권1호
    • /
    • pp.3-14
    • /
    • 2011
  • It has been posited that in English, native listeners use the Metrical Segmentation Strategy (MSS) for the segmentation of continuous speech. Strong syllables tend to be perceived as potential word onsets for English native speakers, which is due to the high proportion of strong syllables word-initially in the English vocabulary. This study investigates whether Koreans employ the same strategy when segmenting speech input in English. Word-spotting experiments were conducted using vowel-initial and consonant-initial bisyllabic targets embedded in nonsense trisyllables in Experiment 1 and 2, respectively. The effect of strong syllable was significant in the RT (reaction times) analysis but not in the error analysis. In both experiments, Korean listeners detected words more slowly when the word-initial syllable is strong (stressed) than when it is weak (unstressed). However, the error analysis showed that there was no effect of initial stress in Experiment 1 and in the item (F2) analysis in Experiment 2. Only the subject (F1) analysis in Experiment 2 showed that the participants made more errors when the word starts with a strong syllable. These findings suggest that Koran listeners do not use the Metrical Segmentation Strategy for segmenting English speech. They do not treat strong syllables as word beginnings, but rather have difficulties recognizing words when the word starts with a strong syllable. These results are discussed in terms of intonational properties of Korean prosodic phrases which are found to serve as lexical segmentation cues in the Korean language.

  • PDF

The Role of Prosodic Boundary Cues in Word Segmentation in Korean

  • Kim, Sa-Hyang
    • 음성과학
    • /
    • 제13권1호
    • /
    • pp.29-41
    • /
    • 2006
  • This study investigates the degree to which various prosodic cues at the boundaries of prosodic phrases in Korean contribute to word segmentation. Since most phonological words in Korean are produced as one Accentual Phrase (AP), it was hypothesized that the detection of acoustic cues at AP boundaries would facilitate word segmentation. The prosodic characteristics of Korean APs include initial strengthening at the beginning of the phrase and pitch rise and final lengthening at the end. A perception experiment utilizing an artificial language learning paradigm revealed that cues conforming to the aforementioned prosodic characteristics of Korean facilitated listeners' word segmentation. Results also indicated that duration and amplitude cues were more helpful in segmentation than pitch. Nevertheless, results did show that a pitch cue that did not conform to the Korean AP interfered with segmentation.

  • PDF

마코프 체인 밀 음절 N-그램을 이용한 한국어 띄어쓰기 및 복합명사 분리 (Korean Word Segmentation and Compound-noun Decomposition Using Markov Chain and Syllable N-gram)

  • 권오욱
    • 한국음향학회지
    • /
    • 제21권3호
    • /
    • pp.274-284
    • /
    • 2002
  • 한국어 대어휘 연속음성인식을 위한 텍스트 전처리에서 띄어쓰기 오류는 잘못된 단어를 인식 어휘에 포함시켜 언어모델의 성능을 저하시킨다. 본 논문에서는 텍스트 코퍼스의 띄어쓰기 교정을 위하여 한국어 음절 N-그램을 이용한 자동 띄어쓰기 알고리듬을 제시한다. 제시된 알고리듬에서는 주어진 입력음절열은 좌에서 우로의 천이만을 갖는 마코프 체인으로 표시되고 어떤 상태에서 같은 상태로의 천이에서 공백음절이 발생하며 다른 상태로의 천이에서는 주어진 음절이 발생한다고 가정한다. 마코프 체인에서 음절 단위 N-그램 언어모델에 의한 문장 확률이 가장 높은 경로를 찾음으로써 띄어쓰기 결과를 얻는다. 모든 공백을 삭제한 254문장으로 이루어진 신문 칼럼 말뭉치에 대하여 띄어쓰기 알고리듬을 적용한 결과 91.58%의 어절단위 정확도 및 96.69%의 음절 정확도를 나타내었다. 띄어쓰기 알고리듬을 응용한 줄바꿈에서의 공백 오류 처리에서 이 알고리듬은 91.00%에서 96.27%로 어절 정확도를 향상시켰으며, 복합명사 분리에서는 96.22%의 분리 정확도를 보였다.

Ternary Decomposition and Dictionary Extension for Khmer Word Segmentation

  • Sung, Thaileang;Hwang, Insoo
    • Journal of Information Technology Applications and Management
    • /
    • 제23권2호
    • /
    • pp.11-28
    • /
    • 2016
  • In this paper, we proposed a dictionary extension and a ternary decomposition technique to improve the effectiveness of Khmer word segmentation. Most word segmentation approaches depend on a dictionary. However, the dictionary being used is not fully reliable and cannot cover all the words of the Khmer language. This causes an issue of unknown words or out-of-vocabulary words. Our approach is to extend the original dictionary to be more reliable with new words. In addition, we use ternary decomposition for the segmentation process. In this research, we also introduced the invisible space of the Khmer Unicode (char\u200B) in order to segment our training corpus. With our segmentation algorithm, based on ternary decomposition and invisible space, we can extract new words from our training text and then input the new words into the dictionary. We used an extended wordlist and a segmentation algorithm regardless of the invisible space to test an unannotated text. Our results remarkably outperformed other approaches. We have achieved 88.8%, 91.8% and 90.6% rates of precision, recall and F-measurement.

음운 구조가 한국어 단어 분절에 미치는 영향 (The role of prosodic phrasing in Korean word segmentation)

  • 김사향
    • 대한음성학회:학술대회논문집
    • /
    • 대한음성학회 2007년도 한국음성과학회 공동학술대회 발표논문집
    • /
    • pp.114-118
    • /
    • 2007
  • The current study investigates the degree to which various prosodic cues at the boundaries of a prosodic phrase in Korean (Accentual Phrase) contributed to word segmentation. Since most phonological words in Korean are produced as one AP, it was hypothesized that the detection of acoustic cues at AP boundaries would facilitate word segmentation. The prosodic characteristics of Korean APs include initial strengthening at the beginning of the phrase and pitch rise and final lengthening at the end. A perception experiment revealed that the cues that conform to the above-mentioned prosodic characteristics of Korean facilitated listeners' word segmentation. Results also showed that duration and amplitude cues were more helpful in segmentation than pitch. Further, the results showed that a pitch cue that did not conform to the Korean AP interfered with segmentation.

  • PDF

Sub-word Based Offline Handwritten Farsi Word Recognition Using Recurrent Neural Network

  • Ghadikolaie, Mohammad Fazel Younessy;Kabir, Ehsanolah;Razzazi, Farbod
    • ETRI Journal
    • /
    • 제38권4호
    • /
    • pp.703-713
    • /
    • 2016
  • In this paper, we present a segmentation-based method for offline Farsi handwritten word recognition. Although most segmentation-based systems suffer from segmentation errors within the first stages of recognition, using the inherent features of the Farsi writing script, we have segmented the words into sub-words. Instead of using a single complex classifier with many (N) output classes, we have created N simple recurrent neural network classifiers, each having only true/false outputs with the ability to recognize sub-words. Through the extraction of the number of sub-words in each word, and labeling the position of each sub-word (beginning/middle/end), many of the sub-word classifiers can be pruned, and a few remaining sub-word classifiers can be evaluated during the sub-word recognition stage. The candidate sub-words are then joined together and the closest word from the lexicon is chosen. The proposed method was evaluated using the Iranshahr database, which consists of 17,000 samples of Iranian handwritten city names. The results show the high recognition accuracy of the proposed method.

확률 기반 미등록 단어 분리 및 태깅 (Probabilistic Segmentation and Tagging of Unknown Words)

  • 김보겸;이재성
    • 정보과학회 논문지
    • /
    • 제43권4호
    • /
    • pp.430-436
    • /
    • 2016
  • 형태소 분석시 나타나는 고유명사나 신조어 등의 미등록어에 대한 처리는 다양한 도메인의 문서 처리에 필수적이다. 이 논문에서는 3단계 확률 기반 형태소 분석에서 미등록어를 분리하고 태깅하기 위한 방법을 제시한다. 이 방법은 고유명사나 일반명사와 같은 개방어 뒤에 붙는 다양한 접미사를 분석하여 미등록 개방어를 추정할 수 있도록 했다. 이를 위해 형태소 품사 부착 말뭉치에서 자동으로 접미사 패턴을 학습하고, 확률 기반 형태소 분석에 맞도록 미등록 개방어의 분리 및 태깅 확률을 계산하는 방법을 제시하였다. 실험 결과, 제안한 방법은 새로운 미등록 용어가 많이 나오는 문서에서 미등록어 처리 성능을 크게 향상시켰다.

The Role of Post-lexical Intonational Patterns in Korean Word Segmentation

  • Kim, Sa-Hyang
    • 음성과학
    • /
    • 제14권1호
    • /
    • pp.37-62
    • /
    • 2007
  • The current study examines the role of post-lexical tonal patterns of a prosodic phrase in word segmentation. In a word spotting experiment, native Korean listeners were asked to spot a disyllabic or trisyllabic word from twelve syllable speech stream that was composed of three Accentual Phrases (AP). Words occurred with various post-lexical intonation patterns. The results showed that listeners spotted more words in phrase-initial than in phrase-medial position, suggesting that the AP-final H tone from the preceding AP helped listeners to segment the phrase-initial word in the target AP. Results also showed that listeners' error rates were significantly lower when words occurred with initial rising tonal pattern, which is the most frequent intonational pattern imposed upon multisyllabic words in Korean, than with non-rising patterns. This result was observed both in AP-initial and in AP-medial positions, regardless of the frequency and legality of overall AP tonal patterns. Tonal cues other than initial rising tone did not positively influence the error rate. These results not only indicate that rising tone in AP-initial and AP_final position is a reliable cue for word boundary detection for Korean listeners, but further suggest that phrasal intonation contours serve as a possible word boundary cue in languages without lexical prominence.

  • PDF