• Title/Summary/Keyword: Word segmentation

Search Result 135, Processing Time 0.028 seconds

A Study on Consonant/Vowel/Unvoiced Consonant Phonetic Value Segmentation and Recognition of Korean Isolated Word Speech (한국어 고립 단어 음성의 자음/모음/유성자음 음가 분할 및 인식에 관한 연구)

  • Lee, Jun-Hwan;Lee, Sang-Beom
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.6
    • /
    • pp.1964-1972
    • /
    • 2000
  • For the Korean language, on acoustics, it creates a different form of phonetic value not a phoneme by its own peculiar property. Therefore, the construction of extended recognition system for understanding Korean language should be created with a study of the Korean rule-based system, before it can be used as post-processing of the Korean recognition system. In this paper, text-based Korean rule-based system featuring Korean peculiar vocal sound changing rule is constructed. and based on the text-based phonetic value result of the system constructed, a preliminary phonetic value segmentation border points with non-uniform blocks are extracted in Korean isolated word speech. Through the way of merge and recognition of the non-uniform blocks between the extracted border points, recognition possibility of Korean voice as the form of the phonetic vale has been investigated.

  • PDF

Performance of Pseudomorpheme-Based Speech Recognition Units Obtained by Unsupervised Segmentation and Merging (비교사 분할 및 병합으로 구한 의사형태소 음성인식 단위의 성능)

  • Bang, Jeong-Uk;Kwon, Oh-Wook
    • Phonetics and Speech Sciences
    • /
    • v.6 no.3
    • /
    • pp.155-164
    • /
    • 2014
  • This paper proposes a new method to determine the recognition units for large vocabulary continuous speech recognition (LVCSR) in Korean by applying unsupervised segmentation and merging. In the proposed method, a text sentence is segmented into morphemes and position information is added to morphemes. Then submorpheme units are obtained by splitting the morpheme units through the maximization of posterior probability terms. The posterior probability terms are computed from the morpheme frequency distribution, the morpheme length distribution, and the morpheme frequency-of-frequency distribution. Finally, the recognition units are obtained by sequentially merging the submorpheme pair with the highest frequency. Computer experiments are conducted using a Korean LVCSR with a 100k word vocabulary and a trigram language model obtained by a 300 million eojeol (word phrase) corpus. The proposed method is shown to reduce the out-of-vocabulary rate to 1.8% and reduce the syllable error rate relatively by 14.0%.

Automatic Word Spacing for Korean Using CRFs with Korean Features (한국어 특성과 CRFs를 이용한 자동 띄어쓰기 시스템)

  • Lee, Hyun-Woo;Cha, Jeong-Won
    • MALSORI
    • /
    • no.65
    • /
    • pp.125-141
    • /
    • 2008
  • In this work, we propose an automatic word spacing system for Korean using conditional random fields (CRFs) with Korean features. We map a word spacing problem into a classification problem in our work. We build a basic system which uses CRFs and Eumjeol bigram. After then, we analyze the result of inner-test. We extend a basic system added by some Korean features which are Josa, Eomi and two head Eumjeols of word extracting from lexicon. From the results of experiment, we can see that the proposed method is better than previous methods. Additionally the proposed method will be able to use mobile and speech applications because of very small size of model.

  • PDF

Language-Independent Word Acquisition Method Using a State-Transition Model

  • Xu, Bin;Yamagishi, Naohide;Suzuki, Makoto;Goto, Masayuki
    • Industrial Engineering and Management Systems
    • /
    • v.15 no.3
    • /
    • pp.224-230
    • /
    • 2016
  • The use of new words, numerous spoken languages, and abbreviations on the Internet is extensive. As such, automatically acquiring words for the purpose of analyzing Internet content is very difficult. In a previous study, we proposed a method for Japanese word segmentation using character N-grams. The previously proposed method is based on a simple state-transition model that is established under the assumption that the input document is described based on four states (denoted as A, B, C, and D) specified beforehand: state A represents words (nouns, verbs, etc.); state B represents statement separators (punctuation marks, conjunctions, etc.); state C represents postpositions (namely, words that follow nouns); and state D represents prepositions (namely, words that precede nouns). According to this state-transition model, based on the states applied to each pseudo-word, we search the document from beginning to end for an accessible pattern. In other words, the process of this transition detects some words during the search. In the present paper, we perform experiments based on the proposed word acquisition algorithm using Japanese and Chinese newspaper articles. These articles were obtained from Japan's Kyoto University and the Chinese People's Daily. The proposed method does not depend on the language structure. If text documents are expressed in Unicode the proposed method can, using the same algorithm, obtain words in Japanese and Chinese, which do not contain spaces between words. Hence, we demonstrate that the proposed method is language independent.

Segmenting and Classifying Korean Words based on Syllables Using Instance-Based Learning (사례기반 학습을 이용한 음절기반 한국어 단어 분리 및 범주 결정)

  • Kim, Jae-Hoon;Lee, Kong-Joo
    • The KIPS Transactions:PartB
    • /
    • v.10B no.1
    • /
    • pp.47-56
    • /
    • 2003
  • Korean delimits words by white-space like English, but words In Korean Is a little different in structure from those in English. Words in English generally consist of one word, but those in Korean are composed of one word and/or morpheme or more. Because of this difference, a word between white-spaces is called an Eojeol in Korean. We propose a method for segmenting and classifying Korean words and/or morphemes based on syllables using an instance-based learning. In this paper, elements of feature sets for the instance-based learning are one previous syllable, one current syllable, two next syllables, a final consonant of the current syllable, and two previous categories. Our method shows more than 97% of the F-measure of word segmentation using ETRI corpus and KAIST corpus.

Phonological Awareness in Hearing Impaired Children (청각장애아동의 음운인식능력에 대한 연구)

  • Park, Sang-Hee;Seok, Dong-Il;Jeong, Ok-Ran
    • Speech Sciences
    • /
    • v.9 no.2
    • /
    • pp.193-202
    • /
    • 2002
  • The purpose of this study is to examine the phonological awareness of hearing impaired children. A number of researches indicate that hearing impaired children have articulation disorders due to their impaired auditory feedback. However, in children who have the ability to distinguish certain phonemes, they sometimes show misarticulation of the phonemes. Phonological awareness refers to recognizing the speech-sound units and their forms in spoken language (Hong, 2001). The subjects who participated in the experiment are composed of four hearing impaired children (3 cochlear implanted children and 1 hearing aided child). Phonological Awareness was evaluated by the test battery developed by Paik et al. (2001). The subtests consisted of rhyme matching, onset matching I II, word initial segmentation and matching I II. If the children asked for retelling, it was retold to a maximum of 4 times. Each item score was 1 point. The results were compared to those of Paik et al. (2001). The results of study were that subject 1 showed superior rhyme matching ability, subjects 2 and 3 fair ability, and subject 4 inferior ability. In onset matching I, all subjects showed inferior ability except for subject 3. Interestingly, subjects 1 showed the lowest onset matching I score. In word initial segmentation and matching I, subjects 1 and 4 showed inferior ability and subjects 2 and 3 showed fair ability. In onset matching II, subject 2 showed the perfect score 10 even though she showed very low score. In word initial segmentation and matching II, only subjects 2 and 3 showed appropriate levels of the skill. The results show that the phonological awareness of hearing impaired children is different from that of normal children.

  • PDF

An Approach to Segmentation of Address Strings of unconstrained handwritten Hangul using Run-Length Code (Rum-Length code를 이용한 제약없이 쓰여진 한글 필기체 주소열 분할)

  • Kim, Gyeonghwan;Yoon, Jason-J
    • Journal of KIISE:Software and Applications
    • /
    • v.28 no.11
    • /
    • pp.813-821
    • /
    • 2001
  • While recognition of isolated units of writing, such as a character or a word, has been extensively studied, emphasis on the segmentation itself has been lacking. In this paper we propose an active segmentation method for handwritten Hangul address strings based on the Run-length code. A slant correction algorithm, which is considered as an important preprocessing step for the segmentation, is presented. Three fundamental candidate estimation functions are introduced to detect the clues on touching points, and the classification of touching types is attempted depending on the structural peculiarity of Hangul. Our experiments show segmentation performance of 88.2% on touching characters with minimal over-segmentation.

  • PDF

Effects of Dining-out Motives and Attribute Evaluation of Restaurants on the Intention of Word of Mouth and Reusing (외식 동기와 레스토랑 속성 평가가 구전 및 재이용 의도에 미치는 영향)

  • Kim, Seog-Jun;Cho, Yong-Bum
    • Culinary science and hospitality research
    • /
    • v.12 no.3 s.30
    • /
    • pp.61-74
    • /
    • 2006
  • The objective of this study are to examine how the factors influence each other by determining the appropriate measurement standard based on the Dining-out Motives, restaurant attribute evaluation, intention of word of mouth and reusing, an effective restaurant marketing strategy on the basis of the analytical results by patrons and market segmentations. The study surveyed 321 subjects and processed the result using SPSS for Win. V. 12.1. For statistical analysis, Frequency, Factor Analysis, and Regression were put into operation. The results showed that dining-out motives and restaurant attribute evaluation had positive effects on the intention of word of mouth and reusing. Furthermore, the restaurant owners or managers will need to understand the existing market and target markets more objectively, and, through this market segmentation, formulate a marketing strategy that is appropriate for diverse desires of customers and different characteristics of restaurants.

  • PDF

A study on character segmentation and determination of linguistic type for recognition of on-line cursive characters (온라인 연속 필기 문자의 인식을 위한 문자간 구분 및 종류의 결정에 관한 연구)

  • 박강령;전병환;김창수;김우성;김재희
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.34C no.7
    • /
    • pp.61-69
    • /
    • 1997
  • With the vigorous researches in the character recognition, the need to recognize run-on multilingual handwritten characters is increasing to provide uses with more comfortable PUI(pen user interface) environments. In general, many intermediate word candidates word candidates are generated in run-on multilingual recognition because there is no information of ending position and linguistic kind of character. To remove unnecessary word candidates which are generated in run-on multilingual recognition, we classify them into two groups and select the best candidate among the word candidates in the group where the final characater is completed using 5 attributes. In this research, we propose a method in order to select the best one candidate. It is called WRM (Weighted ranking method). The weights are adaptively trained by LMS(Least mean square) learning rule. Results show that the abilities of decision makin gusing weights are much better than those not using weights.

  • PDF