• Title/Summary/Keyword: data dictionary

Search Result 350, Processing Time 0.029 seconds

(A Method to Classify and Recognize Spelling Changes between Morphemes of a Korean Word) (한국어 어절의 철자변화 현상 분류와 인식 방법)

  • 김덕봉
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.5_6
    • /
    • pp.476-486
    • /
    • 2003
  • There is no explicit spelling change information in part-of-speech tagged corpora of Korean. It causes some difficulties in acquiring the data to study Korean morphology, i.e. automatically in constructing a dictionary for morphological analysis and systematically in collecting the phenomena of the spelling changes from the corpora. To solve this problem, this paper presents a method to recognize spelling changes between morphemes of a Korean word in tagged corpora, only using a string matching, without using a dictionary and phonological rules. This method not only has an ability to robustly recognize the spelling changes because it doesn't use any phonological rules, but also can be implemented with few cost. This method has been experimented with a large tagged corpus of Korean, and recognized the 100% of spelling changes in the corpus with accuracy.

An Improved Homonym Disambiguation Model based on Bayes Theory (Bayes 정리에 기반한 개선된 동형이의어 분별 모텔)

  • 김창환;이왕우
    • Journal of the Korea Computer Industry Society
    • /
    • v.2 no.12
    • /
    • pp.1581-1590
    • /
    • 2001
  • This paper asserted more developmental model of WSD(word sense disambiguation) than J. Hur(2000)'s WSD model. This model suggested an improved statistical homonym disambiguation Model based on Bayes Theory. This paper using semantic information(co-occurrence data) obtained from definitions of part of speech(POS) tagged UMRD-S(Ulsan university Machine Readable Dictionary(Semantic Tagged)). we extracted semantic features in the context as nouns, predicates and adverbs from the definitions in the korean dictionary. In this research, we make an experiment with the accuracy of WSD system about major nine homonym nouns and new seven homonym predicates supplementary. The inner experimental result showed average accuracy of 98.32% with regard to the most Nine homonym nouns and 99.53% for the Seven homonym predicates. An Addition, we save test on Korean Information Base and ETRI's POS tagged corpus. This external experimental result showed average accuracy of 84.42% with regard to the most Nine nouns over unsupervised learning sentences from Korean Information Base and ETRI Corpus, 70.81 % accuracy rate for the Seven predicates from Sejong Project phrase part tagging corpus (3.5 million phrases) too.

  • PDF

Implementation of Iconic Language for the Language Support System of the Language Disorders (언어 장애인의 언어보조 시스템을 위한 아이콘 언어의 구현)

  • Choo Kyo-Nam;Woo Yo-Seob;Min Hong-Ki
    • The KIPS Transactions:PartB
    • /
    • v.13B no.4 s.107
    • /
    • pp.479-488
    • /
    • 2006
  • The iconic language interlace is designed to provide more convenient environments for communication to the target system than the keyboard-based interface. For this work, tendencies and features of vocabulary are analyzed in conversation corpora constructed from the corresponding domains with high degree of utilization, and the meaning and vocabulary system of iconic language are constructed through application of natural language processing methodologies such as morphological, syntactic and semantic analyses. The part of speech and grammatical rules of iconic language are defined in order to make the situation corresponding the icon to the vocabulary and meaning of the Korean language and to communicate through icon sequence. For linguistic ambiguity resolution which may occur in the iconic language and for effective semantic processing, semantic data focused on situation of the iconic language are constructed from the general purpose Korean semantic dictionary and subcategorization dictionary. Based on them, the Korean language generation from the iconic interface in semantic domain is suggested.

Retrieving Minority Product Reviews Using Positive/Negative Skewness (긍정/부정 비대칭도를 이용한 소수상품평의 검색)

  • Cho, Heeryon;Lee, Jong-Seok
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.4 no.3
    • /
    • pp.121-128
    • /
    • 2015
  • A given product's online product reviews build up to form largely positive or negative reviews or mixed reviews that include both the positive and negative reviews. While the homogeneously positive or negative reviews help readers identify the generally praised or criticized product, the mixed reviews with minority opinions potentially contain valuable information about the product. We present a method of retrieving minority opinions from the online product reviews using the skewness of positive/negative reviews. The proposed method first classifies the positive/negative product reviews using a sentiment dictionary and then calculates the skewness of the classified results to identify minority reviews. Minority review retrieval experiments were conducted on smartphone and movie reviews, and the F1-measures were 24.6% (smartphone) and 15.9% (movie) and the accuracies were 56.8% and 46.8% when the individual reviews' sentiment classification accuracies were 85.3% and 78.8%. The theoretical performance of minority review retrieval is also discussed.

KONG-DB: Korean Novel Geo-name DB & Search and Visualization System Using Dictionary from the Web (KONG-DB: 웹 상의 어휘 사전을 활용한 한국 소설 지명 DB, 검색 및 시각화 시스템)

  • Park, Sung Hee
    • Journal of the Korean Society for information Management
    • /
    • v.33 no.3
    • /
    • pp.321-343
    • /
    • 2016
  • This study aimed to design a semi-automatic web-based pilot system 1) to build a Korean novel geo-name, 2) to update the database using automatic geo-name extraction for a scalable database, and 3) to retrieve/visualize the usage of an old geo-name on the map. In particular, the problem of extracting novel geo-names, which are currently obsolete, is difficult to solve because obtaining a corpus used for training dataset is burden. To build a corpus for training data, an admin tool, HTML crawler and parser in Python, crawled geo-names and usages from a vocabulary dictionary for Korean New Novel enough to train a named entity tagger for extracting even novel geo-names not shown up in a training corpus. By means of a training corpus and an automatic extraction tool, the geo-name database was made scalable. In addition, the system can visualize the geo-name on the map. The work of study also designed, implemented the prototype and empirically verified the validity of the pilot system. Lastly, items to be improved have also been addressed.

Statistical Ranking Recommendation System of Hangul-to-Roman Conversion for Korean Names (한글-로마자 인명 변환의 통계적 순위 추천 시스템)

  • Lee, Jung-Hun;Kim, Minho;Kwon, Hyuk-Chul
    • Journal of KIISE
    • /
    • v.44 no.12
    • /
    • pp.1269-1274
    • /
    • 2017
  • This paper focuses on the Hangul-to-roman conversion of Korean names. The proposed method recognizes existing notation and provides results according to the frequency of use. There are two main reasons for the diversity in Hangul-to-roman name conversion. The first is the indiscreet use of varied notation made domestically and overseas. The second is the customary notation of current notation. For these reasons, it has become possible to express various Roman characters in Korean names. The system constructs and converts data from 4 million people into a statistical dictionary. In the first step, the person's name is judged through a process matching the last name. In the second step, the first name is compared and converted in the statistical dictionary. In the last step, the syllables in the name are compared and converted, and the results are ranked according to the frequency of use. This paper measured the performance compared to the existing service systems on the web. The results showed a somewhat higher performance than other systems.

Design and Implementation of OCR Correction Model for Numeric Digits based on a Context Sensitive and Multiple Streams (제한적 문맥 인식과 다중 스트림을 기반으로 한 숫자 정정 OCR 모델의 설계 및 구현)

  • Shin, Hyun-Kyung
    • The KIPS Transactions:PartD
    • /
    • v.18D no.1
    • /
    • pp.67-80
    • /
    • 2011
  • On an automated business document processing system maintaining financial data, errors on query based retrieval of numbers are critical to overall performance and usability of the system. Automatic spelling correction methods have been emerged and have played important role in development of information retrieval system. However scope of the methods was limited to the symbols, for example alphabetic letter strings, which can be reserved in the form of trainable templates or custom dictionary. On the other hand, numbers, a sequence of digits, are not the objects that can be reserved into a dictionary but a pure markov sequence. In this paper we proposed a new OCR model for spelling correction for numbers using the multiple streams and the context based correction on top of probabilistic information retrieval framework. We implemented the proposed error correction model as a sub-module and integrated into an existing automated invoice document processing system. We also presented the comparative test results that indicated significant enhancement of overall precision of the system by our model.

Hyperspectral Image Classification via Joint Sparse representation of Multi-layer Superpixles

  • Sima, Haifeng;Mi, Aizhong;Han, Xue;Du, Shouheng;Wang, Zhiheng;Wang, Jianfang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.10
    • /
    • pp.5015-5038
    • /
    • 2018
  • In this paper, a novel spectral-spatial joint sparse representation algorithm for hyperspectral image classification is proposed based on multi-layer superpixels in various scales. Superpixels of various scales can provide complete yet redundant correlated information of the class attribute for test pixels. Therefore, we design a joint sparse model for a test pixel by sampling similar pixels from its corresponding superpixels combinations. Firstly, multi-layer superpixels are extracted on the false color image of the HSI data by principal components analysis model. Secondly, a group of discriminative sampling pixels are exploited as reconstruction matrix of test pixel which can be jointly represented by the structured dictionary and recovered sparse coefficients. Thirdly, the orthogonal matching pursuit strategy is employed for estimating sparse vector for the test pixel. In each iteration, the approximation can be computed from the dictionary and corresponding sparse vector. Finally, the class label of test pixel can be directly determined with minimum reconstruction error between the reconstruction matrix and its approximation. The advantages of this algorithm lie in the development of complete neighborhood and homogeneous pixels to share a common sparsity pattern, and it is able to achieve more flexible joint sparse coding of spectral-spatial information. Experimental results on three real hyperspectral datasets show that the proposed joint sparse model can achieve better performance than a series of excellent sparse classification methods and superpixels-based classification methods.

Speech Synthesis for the Korean large Vocabulary Through the Waveform Analysis in Time Domains and Evauation of Synthesized Speech Quality (시간영역에서의 파형분석에 의한 무제한 어휘 합성 및 음절 유형별 규칙합성음 음질평가)

  • Kang, Chan-Hee;Chin, Yong-Ohk
    • The Journal of the Acoustical Society of Korea
    • /
    • v.13 no.1
    • /
    • pp.71-83
    • /
    • 1994
  • This paper deals with the improvement of the synthesized speech quality and naturality in the Korean TTS(Text-to-Speech) system. We had extracted the parameters(table2) such as its amplitude, duration and pitch period in a syllable through the analysis of speech waveforms(table1) in the time domain and synthesized syllables using them. To the frequencies of the Korean pronunciation large vocabulary dictionary we had synthesized speeches selected 229 syllables such as V types are 19, CV types are 80. VC types are 30 and CVC types are 100. According to the 4 Korean syllable types from the data format dictionary(table3) we had tested each 15 syllables with the objective MOS(Mean Opinion Score) evaluation method about the 4 items i.e., intelligibility, clearness, loudness, and naturality after selecting random group without the knowledge of them. As the results of experiments the qualities of them are very clear and we can control the prosodic elements such as durations, accents and pitch periods (fig9, 10, 11, 12).

  • PDF

Design and Implementation of Tool Server and License Server REL/RDD processing based on MPEG-21 Framework (MPEG-21 프레임워크 기반의 REL/RDD 처리를 위한 라이센스 서버와 툴 서버의 설계 및 구현)

  • Hong, Hyun-Woo;Ryu, Kwang-Hee;Kim, Kwang-Yong;Kim, Jae-Gon;Jung, Hoe-Kyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • v.9 no.2
    • /
    • pp.623-626
    • /
    • 2005
  • The technique of developing Digital Contents have still not be a standard, and it is cause some problems in the Digital Contents's creating, circulation and consumption, So solve the problem, MPEG suggest MPEG-21 framework. In the standard, The IPMP take charge of the Digital Contents's protection and management, and also it is the same as the rights expression language REL and the dictionary defining the word of REL. But. the study of the IPMP is later than the study of REL and RDD, such as the other study of the MPEG-21 standard. So, there is few system based REL and RDD. In this paper, in order to management and protect contents right. So facing the latest standard, we designed and implementation the Tool Server and the License Server based on REL/RDD.

  • PDF