• Title/Summary/Keyword: Similar Words

Search Result 612, Processing Time 0.028 seconds

On Improving Discriminability amaong Acoustically Similar Words by Modified Distance Metric (변형된 거리척도에 의한 음향학적으로 유사한 단어들 사이의 변별력 개선)

  • 김형순
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1987.11a
    • /
    • pp.89-92
    • /
    • 1987
  • In a template-matchig-based speech recognition syste, excessive weight given to perceptually unimportant spectral variations is undesirable for discriminating among acoustically similar words. By introducing a simple threshold-type nonlinearity applied to the distance metric, the word recognition performance can be improved for a vocabulary with similar sounding words, without modifying the system structure.

  • PDF

Speech Verification using Similar Word Information in Isolated Word Recognition (고립단어 인식에 유사단어 정보를 이용한 단어의 검증)

  • 백창흠;이기정홍재근
    • Proceedings of the IEEK Conference
    • /
    • 1998.10a
    • /
    • pp.1255-1258
    • /
    • 1998
  • Hidden Markov Model (HMM) is the most widely used method in speech recognition. In general, HMM parameters are trained to have maximum likelihood (ML) for training data. This method doesn't take account of discrimination to other words. To complement this problem, this paper proposes a word verification method by re-recognition of the recognized word and its similar word using the discriminative function between two words. The similar word is selected by calculating the probability of other words to each HMM. The recognizer haveing discrimination to each word is realized using the weighting to each state and the weighting is calculated by genetic algorithm.

  • PDF

Exclusion of Non-similar Candidates using Positional Accuracy based on Levenstein Distance from N-best Recognition Results of Isolated Word Recognition (레벤스타인 거리에 기초한 위치 정확도를 이용한 고립 단어 인식 결과의 비유사 후보 단어 제외)

  • Yun, Young-Sun;Kang, Jeom-Ja
    • Phonetics and Speech Sciences
    • /
    • v.1 no.3
    • /
    • pp.109-115
    • /
    • 2009
  • Many isolated word recognition systems may generate non-similar words for recognition candidates because they use only acoustic information. In this paper, we investigate several techniques which can exclude non-similar words from N-best candidate words by applying Levenstein distance measure. At first, word distance method based on phone and syllable distances are considered. These methods use just Levenstein distance on phones or double Levenstein distance algorithm on syllables of candidates. Next, word similarity approaches are presented that they use characters' position information of word candidates. Each character's position is labeled to inserted, deleted, and correct position after alignment between source and target string. The word similarities are obtained from characters' positional probabilities which mean the frequency ratio of the same characters' observations on the position. From experimental results, we can find that the proposed methods are effective for removing non-similar words without loss of system performance from the N-best recognition candidates of the systems.

  • PDF

Various Approaches to Improve Exclusion Performance of Non-similar Candidates from N-best Recognition Results on Isolated Word Recognition (고립 단어 인식 결과의 비유사 후보 단어 제외 성능을 개선하기 위한 다양한 접근 방법 연구)

  • Yun, Young-Sun
    • Phonetics and Speech Sciences
    • /
    • v.2 no.4
    • /
    • pp.153-161
    • /
    • 2010
  • Many isolated word recognition systems may generate non-similar words for recognition candidates because they use only acoustic information. The previous study [1,2] investigated several techniques which can exclude non-similar words from N-best candidate words by applying Levenstein distance measure. This paper discusses the various improving techniques of removing non-similar recognition results. The mentioned methods include comparison penalties or weights, phone accuracy based on confusion information, weights candidates by ranking order and partial comparisons. Through experimental results, it is found that some proposed method keeps more accurate recognition results than the previous method's results.

  • PDF

Korean Language Clustering using Word2Vec (Word2Vec를 이용한 한국어 단어 군집화 기법)

  • Heu, Jee-Uk
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.18 no.5
    • /
    • pp.25-30
    • /
    • 2018
  • Recently with the development of Internet technology, a lot of research area such as retrieval and extracting data have getting important for providing the information efficiently and quickly. Especially, the technique of analyzing and finding the semantic similar words for given korean word such as compound words or generated newly is necessary because it is not easy to catch the meaning or semantic about them. To handle of this problem, word clustering is one of the technique which is grouping the similar words of given word. In this paper, we proposed the korean language clustering technique that clusters the similar words by embedding the words using Word2Vec from the given documents.

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

Comparison Survey Examining Korean and Japanese University Students' Understanding of Foreign Words

  • Lee, Jae Hoon;Arimitsu, Yutaka;Wu, Zhiqiang;Yagi, Hidetsugu
    • Journal of Engineering Education Research
    • /
    • v.17 no.4
    • /
    • pp.54-57
    • /
    • 2014
  • This paper investigated the influence of foreign words, otherwise known as loan words, on global communication abilities of university students from two non English-speaking countries: Korea and Japan. To survey the understanding and usage of foreign words which are from English language and used frequently in daily conversation, questionnaires were administered to Korean and Japanese university students majoring in engineering who shared similar linguistic backgrounds. The results were analyzed from global communication viewpoint. Based on the results, methods for improving global communication skills in engineering education were proposed.

Microblog User Geolocation by Extracting Local Words Based on Word Clustering and Wrapper Feature Selection

  • Tian, Hechan;Liu, Fenlin;Luo, Xiangyang;Zhang, Fan;Qiao, Yaqiong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.10
    • /
    • pp.3972-3988
    • /
    • 2020
  • Existing methods always rely on statistical features to extract local words for microblog user geolocation. There are many non-local words in extracted words, which makes geolocation accuracy lower. Considering the statistical and semantic features of local words, this paper proposes a microblog user geolocation method by extracting local words based on word clustering and wrapper feature selection. First, ordinary words without positional indications are initially filtered based on statistical features. Second, a word clustering algorithm based on word vectors is proposed. The remaining semantically similar words are clustered together based on the distance of word vectors with semantic meanings. Next, a wrapper feature selection algorithm based on sequential backward subset search is proposed. The cluster subset with the best geolocation effect is selected. Words in selected cluster subset are extracted as local words. Finally, the Naive Bayes classifier is trained based on local words to geolocate the microblog user. The proposed method is validated based on two different types of microblog data - Twitter and Weibo. The results show that the proposed method outperforms existing two typical methods based on statistical features in terms of accuracy, precision, recall, and F1-score.

A Study on the Comparison of Korea Good Manufacturing Practice (KGMP) Evaluation Criteria with Certification Criteria of Extramural Herbal Dispensaries (원외탕전실 평가인증기준과 KGMP 평가인증 기준과의 비교연구)

  • Hyeong-Gi Kim;Eui-Hyoung Hwang;Eun-Gyeong Lee;Byung-Mook Lim;Young-Jae Shin;Sun-Young Park;Byung-Cheul Shin
    • The Korea Journal of Herbology
    • /
    • v.38 no.6
    • /
    • pp.61-71
    • /
    • 2023
  • Objectives : This study aimed to find out the future direction of accreditation system of Extramural herbal dispensaries (EHD) by comparing the current criteria of EHD and the existing Korea good manufacturing practice (GMP) regulations. Methods : Among the accreditation criteria of EHD, criteria of general herbal medicine was compared with the pharmaceutical GMP of Korea. The regulations of the accreditation of EHD and the regulations of KGMP were compared and organized with similar things based on the index of KGMP. All criteria from both were extracted for each element, classified into key-words and evaluated by dividing them into the same, similar one and no-matching. Results : Among the 189 criteria of KGMP, 77 criteria were consistent with the accreditation of EHD, and 15 criteria were similar. Based on the accreditation of EHD, 70.4% of the criteria were consistent or similar to KGMP. There were a total of 27 key-words only in the GMP criteria and not in the EHD one. Hence, a total of 25 key-words only in the EHD criteria and not in the GMP one. There were 12 similar key-words, and among them, there were 4 key-words in which accreditation of EHD was more specific than the KGMP. Conclusions : The criteria of general herbal medicine in EHD showed a similar or equivalent level of accreditation criteria compared to that of pharmaceutical GMP in Korea, and it ts believed that it should be considered at the current level to reflect the characteristics of herbal medicine.

A Study on the Development of Vocabulary of Korean Children: Based on the Analysis of the Type of Words (유형별로 본 아동 어휘 발달 특성: 원어정보를 중심으로)

  • Choi Eunah;Kim Soo-Jin;Shin Jiyoung
    • MALSORI
    • /
    • no.52
    • /
    • pp.85-99
    • /
    • 2004
  • The aim of this study is to show developmental characteristics of vocabulary of Korean children. In this study, words were classified according to the origin of words: pure Korean, sino-Korean and foreign words. The results of the present study are as follows: In common nouns, the rate of sino-Korean was 33.6% in 3 year-old children but 50.7% in 8 year-old children. Adverb and prenouns showed the similar rate. The rate of words with foreign origin was 10 ~ 11 % in all age groups.

  • PDF