• Title/Summary/Keyword: Candidate Words

Search Result 80, Processing Time 0.024 seconds

A Spelling Correction System Based on Statistical Data of Spelling Errors (철자오류의 통계자료에 근거한 철자오류 교정시스템)

  • Lim, Han-Kyu;Kim, Ung-Mo
    • The Transactions of the Korea Information Processing Society
    • /
    • v.2 no.6
    • /
    • pp.839-846
    • /
    • 1995
  • In this paper, the spelling errors which are made by human being in the real word processors are collected and analyzed. Based on these data, we make a prototype which can perform spell aid function providing candidate words. The number of candidate characters are minimized by the frequency of Jaso and character, so the number of candidate words could be minimized. The average number of candidate words presented are 3.2 to 8, and 62.1 % to 84.1% of the correct words are presented in the candidate words.

  • PDF

Creating Amnesia Synonyms in Traditional Korean Medicine : Database Utilization (한의학 고전 문헌 데이터베이스에서 활용할 건망 유의어 연구)

  • Kim, Wu-Young;Kwon, Oh-Min
    • The Journal of Korean Medical History
    • /
    • v.25 no.1
    • /
    • pp.5-10
    • /
    • 2012
  • The aim of this study is to create a catalogue of amnesia synonyms in Traditional Korean Medicine for database utilization. A two-staged literature search was carried out. First, two databases(China National Knowledge Infrastructure:CNKI, Oriental Medicine Advanced Searching Integrated System:OASIS) were searched for eligible articles, and a set of candidate words was identified from the articles. Second, the candidate words were searched in 30 medical classics to check the frequency of use. As a result, 9 candidate words including 喜忘(huimang), 善忘(seonmang), 多忘(damang), 好忘(homang), 健忘(geonmang), 遂忘(sumang), 遺忘(yumang), 忘事(mangsa), and 易忘(imang) were identified from the 10 eligible articles. Among the 9 candidate words, 健忘(geonmang) was a descriptor and 7 other words including 喜忘(huimang), 善忘(seonmang), 多忘(damang), 好忘(homang), 遂忘(sumang), 遺忘(yumang), 忘事(mangsa) were non-descriptors. 易忘(imang) was not an adequate synonym for amnesia.

Presidential Candidate's Speech based on Network Analysis : Mainly on the Visibility of the Words and the Connectivity between the Words (18대 대통령 선거 후보자의 연설문 네트워크 분석: 단어의 가시성(visibility)과 단어 간 연결성(connectivity)을 중심으로)

  • Hong, Ju-Hyun;Yun, Hae-Jin
    • The Journal of the Korea Contents Association
    • /
    • v.14 no.9
    • /
    • pp.24-44
    • /
    • 2014
  • This study explores the political meaning of candidate's speech and statement who run for the 18th presidential election in the viewpoint of communication. The visibility of the words and the connectivity between the words are analyzed in the viewpoint of structural aspect and the vision, policy. The visibility of the words is analyzed based on the frequency of the words mentioned in the speech or the statement. The connectivity between the words are analyzed based on the network analysis and expressed by graph. In the case of candidate Park, the key word is the happiness of the people and appointment. The key word for candidate Moon is regime change and the Korean Peninsula and the key word for candidate Ahn is the people and change. This study contributes positively to the study of candidate's discourse in the viewpoint of methodology by using network analysis and exploring scientifically the connectivity of the words. In the theoretical aspect this study uses the results of network analysis for revealing what is the leadership components in the speech and the statement. In conclusion, this study highlights the extension of the communication studies.

A Generation-based Text Steganography by Maintaining Consistency of Probability Distribution

  • Yang, Boya;Peng, Wanli;Xue, Yiming;Zhong, Ping
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.11
    • /
    • pp.4184-4202
    • /
    • 2021
  • Text steganography combined with natural language generation has become increasingly popular. The existing methods usually embed secret information in the generated word by controlling the sampling in the process of text generation. A candidate pool will be constructed by greedy strategy, and only the words with high probability will be encoded, which damages the statistical law of the texts and seriously affects the security of steganography. In order to reduce the influence of the candidate pool on the statistical imperceptibility of steganography, we propose a steganography method based on a new sampling strategy. Instead of just consisting of words with high probability, we select words with relatively small difference from the actual sample of the language model to build a candidate pool, thus keeping consistency with the probability distribution of the language model. What's more, we encode the candidate words according to their probability similarity with the target word, which can further maintain the probability distribution. Experimental results show that the proposed method can outperform the state-of-the-art steganographic methods in terms of security performance.

Detection of Porno Sites on the Web using Fuzzy Inference (퍼지추론을 적용한 웹 음란문서 검출)

  • 김병만;최상필;노순억;김종완
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.11 no.5
    • /
    • pp.419-425
    • /
    • 2001
  • A method to detect lots of porno documents on the internet is presented in this parer. The proposed method applies fuzzy inference mechanism to the conventional information retrieval techniques. First, several example sites on porno arc provided by users and then candidate words representing for porno documents are extracted from theme documents. In this process, lexical analysis and stemming are performed. Then, several values such as tole term frequency(TF), the document frequency(DF), and the Heuristic Information(HI) Is computed for each candidate word. Finally, fuzzy inference is performed with the above three values to weight candidate words. The weights of candidate words arc used to determine whether a liven site is sexual or not. From experiments on small test collection, the proposed method was shown useful to detect the sexual sites automatically.

  • PDF

A Method of Chinese and Thai Cross-Lingual Query Expansion Based on Comparable Corpus

  • Tang, Peili;Zhao, Jing;Yu, Zhengtao;Wang, Zhuo;Xian, Yantuan
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.805-817
    • /
    • 2017
  • Cross-lingual query expansion is usually based on the relationship among monolingual words. Bilingual comparable corpus contains relationships among bilingual words. Therefore, this paper proposes a method based on these relationships to conduct query expansion. First, the word vectors which characterize the bilingual words are trained using Chinese and Thai bilingual comparable corpus. Then, the correlation between Chinese query words and Thai words are computed based on these word vectors, followed with selecting the Thai candidate expansion terms via the correlative value. Then, multi-group Thai query expansion sentences are built by the Thai candidate expansion words based on Chinese query sentence. Finally, we can get the optimal sentence using the Chinese and Thai query expansion method, and perform the Thai query expansion. Experiment results show that the cross-lingual query expansion method we proposed can effectively improve the accuracy of Chinese and Thai cross-language information retrieval.

Exclusion of Non-similar Candidates using Positional Accuracy based on Levenstein Distance from N-best Recognition Results of Isolated Word Recognition (레벤스타인 거리에 기초한 위치 정확도를 이용한 고립 단어 인식 결과의 비유사 후보 단어 제외)

  • Yun, Young-Sun;Kang, Jeom-Ja
    • Phonetics and Speech Sciences
    • /
    • v.1 no.3
    • /
    • pp.109-115
    • /
    • 2009
  • Many isolated word recognition systems may generate non-similar words for recognition candidates because they use only acoustic information. In this paper, we investigate several techniques which can exclude non-similar words from N-best candidate words by applying Levenstein distance measure. At first, word distance method based on phone and syllable distances are considered. These methods use just Levenstein distance on phones or double Levenstein distance algorithm on syllables of candidates. Next, word similarity approaches are presented that they use characters' position information of word candidates. Each character's position is labeled to inserted, deleted, and correct position after alignment between source and target string. The word similarities are obtained from characters' positional probabilities which mean the frequency ratio of the same characters' observations on the position. From experimental results, we can find that the proposed methods are effective for removing non-similar words without loss of system performance from the N-best recognition candidates of the systems.

  • PDF

An Internal Segmentation Method for the On-line Recognition of Run-on Characters (온라인 연속 필기 한글의 인식을 위한 내부 문자 분할에 관한 연구)

  • 정진영;전병환;김우성;김재희
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.32B no.9
    • /
    • pp.1231-1238
    • /
    • 1995
  • In on-line character recognition, to segment input character is important. This paper proposes an internal character segmentation algorithm. The internal segmentation algorithm produces candidate words by considering possible combinations of Korean alphabets. In this process, we make use of projections of strokes onto the horizontal axis to remove ambiguities among candidate words. As a result of experiments, the internal segmentation algorithm shows better performance than external segmentation algorithm as the gap between sample characters becomes smaller.

  • PDF

Various Approaches to Improve Exclusion Performance of Non-similar Candidates from N-best Recognition Results on Isolated Word Recognition (고립 단어 인식 결과의 비유사 후보 단어 제외 성능을 개선하기 위한 다양한 접근 방법 연구)

  • Yun, Young-Sun
    • Phonetics and Speech Sciences
    • /
    • v.2 no.4
    • /
    • pp.153-161
    • /
    • 2010
  • Many isolated word recognition systems may generate non-similar words for recognition candidates because they use only acoustic information. The previous study [1,2] investigated several techniques which can exclude non-similar words from N-best candidate words by applying Levenstein distance measure. This paper discusses the various improving techniques of removing non-similar recognition results. The mentioned methods include comparison penalties or weights, phone accuracy based on confusion information, weights candidates by ranking order and partial comparisons. Through experimental results, it is found that some proposed method keeps more accurate recognition results than the previous method's results.

  • PDF

Representative Labels Selection Technique for Document Cluster using WordNet (문서 클러스터를 위한 워드넷기반의 대표 레이블 선정 방법)

  • Kim, Tae-Hoon;Sohn, Mye
    • Journal of Internet Computing and Services
    • /
    • v.18 no.2
    • /
    • pp.61-73
    • /
    • 2017
  • In this paper, we propose a Documents Cluster Labeling method using information content of words in clusters to understand what the clusters imply. To do so, we calculate the weight and frequency of the words. These two measures are used to determine the weight among the words in the cluster. As a nest step, we identify the candidate labels using the WordNet. At this time, the candidate labels are matched to least common hypernym of the words in the cluster. Finally, the representative labels are determined with respect to information content of the words and the weight of the words. To prove the superiority of our method, we perform the heuristic experiment using two kinds of measures, named the suitability of the candidate label ($Suitability_{cl}$) and the appropriacy of representative label ($Appropriacy_{rl}$). In applying the method proposed in this research, in case of suitability of the candidate label, it decreases slightly compared with existing methods, but the computational cost is about 20% of the conventional methods. And we confirmed that appropriacy of the representative label is better results than the existing methods. As a result, it is expected to help data analysts to interpret the document cluster easier.