• Title/Summary/Keyword: Word Based

Search Result 2,401, Processing Time 0.029 seconds

Comparison Thai Word Sense Disambiguation Method

  • Modhiran, Teerapong;Kruatrachue, Boontee;Supnithi, Thepchai
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2004.08a
    • /
    • pp.1307-1312
    • /
    • 2004
  • Word sense disambiguation is one of the most important problems in natural language processing research topics such as information retrieval and machine translation. Many approaches can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledge-based, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy. The purpose of this paper is to compare three famous machine learning techniques, Snow, SVM and Naive Bayes in Word-Sense Disambiguation on Thai language. 10 ambiguous words are selected to test with word and POS features. The results show that SVM algorithm gives the best results in solving of Thai WSD and the accuracy rate is approximately 83-96%.

  • PDF

Sub-word Based Offline Handwritten Farsi Word Recognition Using Recurrent Neural Network

  • Ghadikolaie, Mohammad Fazel Younessy;Kabir, Ehsanolah;Razzazi, Farbod
    • ETRI Journal
    • /
    • v.38 no.4
    • /
    • pp.703-713
    • /
    • 2016
  • In this paper, we present a segmentation-based method for offline Farsi handwritten word recognition. Although most segmentation-based systems suffer from segmentation errors within the first stages of recognition, using the inherent features of the Farsi writing script, we have segmented the words into sub-words. Instead of using a single complex classifier with many (N) output classes, we have created N simple recurrent neural network classifiers, each having only true/false outputs with the ability to recognize sub-words. Through the extraction of the number of sub-words in each word, and labeling the position of each sub-word (beginning/middle/end), many of the sub-word classifiers can be pruned, and a few remaining sub-word classifiers can be evaluated during the sub-word recognition stage. The candidate sub-words are then joined together and the closest word from the lexicon is chosen. The proposed method was evaluated using the Iranshahr database, which consists of 17,000 samples of Iranian handwritten city names. The results show the high recognition accuracy of the proposed method.

Retrieval Model Based on Word Translation Probabilities and the Degree of Association of Query Concept (어휘 번역확률과 질의개념연관도를 반영한 검색 모델)

  • Kim, Jun-Gil;Lee, Kyung-Soon
    • The KIPS Transactions:PartB
    • /
    • v.19B no.3
    • /
    • pp.183-188
    • /
    • 2012
  • One of the major challenge for retrieval performance is the word mismatch between user's queries and documents in information retrieval. To solve the word mismatch problem, we propose a retrieval model based on the degree of association of query concept and word translation probabilities in translation-based model. The word translation probabilities are calculated based on the set of a sentence and its succeeding sentence pair. To validate the proposed method, we experimented on TREC AP test collection. The experimental results show that the proposed model achieved significant improvement over the language model and outperformed translation-based language model.

Microblog User Geolocation by Extracting Local Words Based on Word Clustering and Wrapper Feature Selection

  • Tian, Hechan;Liu, Fenlin;Luo, Xiangyang;Zhang, Fan;Qiao, Yaqiong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.10
    • /
    • pp.3972-3988
    • /
    • 2020
  • Existing methods always rely on statistical features to extract local words for microblog user geolocation. There are many non-local words in extracted words, which makes geolocation accuracy lower. Considering the statistical and semantic features of local words, this paper proposes a microblog user geolocation method by extracting local words based on word clustering and wrapper feature selection. First, ordinary words without positional indications are initially filtered based on statistical features. Second, a word clustering algorithm based on word vectors is proposed. The remaining semantically similar words are clustered together based on the distance of word vectors with semantic meanings. Next, a wrapper feature selection algorithm based on sequential backward subset search is proposed. The cluster subset with the best geolocation effect is selected. Words in selected cluster subset are extracted as local words. Finally, the Naive Bayes classifier is trained based on local words to geolocate the microblog user. The proposed method is validated based on two different types of microblog data - Twitter and Weibo. The results show that the proposed method outperforms existing two typical methods based on statistical features in terms of accuracy, precision, recall, and F1-score.

The Sentence Similarity Measure Using Deep-Learning and Char2Vec (딥러닝과 Char2Vec을 이용한 문장 유사도 판별)

  • Lim, Geun-Young;Cho, Young-Bok
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.10
    • /
    • pp.1300-1306
    • /
    • 2018
  • The purpose of this study is to see possibility of Char2Vec as alternative of Word2Vec that most famous word embedding model in Sentence Similarity Measure Problem by Deep-Learning. In experiment, we used the Siamese Ma-LSTM recurrent neural network architecture for measure similarity two random sentences. Siamese Ma-LSTM model was implemented with tensorflow. We train each model with 200 epoch on gpu environment and it took about 20 hours. Then we compared Word2Vec based model training result with Char2Vec based model training result. as a result, model of based with Char2Vec that initialized random weight record 75.1% validation dataset accuracy and model of based with Word2Vec that pretrained with 3 million words and phrase record 71.6% validation dataset accuracy. so Char2Vec is suitable alternate of Word2Vec to optimize high system memory requirements problem.

A Study on the Enhancement of GUI Based High-end Word processor (GUI 환경의 고성능 워드프로세서의 발전 방향에 관한 연구)

  • 홍원기;이상렬
    • KSCI Review
    • /
    • v.2 no.2
    • /
    • pp.19-26
    • /
    • 1996
  • The improvement of word processor has been done rapidly according with generalized acceptance of GUI Environment such as Windows. So in this paper, with the result of analysing the functions of korean word processor on Windows, I will propose the method of enhancement of GUI based word processor and discuss the direction of multimedia word processor.

  • PDF

An Algorithm for Text Image Watermarking based on Word Classification (단어 분류에 기반한 텍스트 영상 워터마킹 알고리즘)

  • Kim Young-Won;Oh Il-Seok
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.8
    • /
    • pp.742-751
    • /
    • 2005
  • This paper proposes a novel text image watermarking algorithm based on word classification. The words are classified into K classes using simple features. Several adjacent words are grouped into a segment. and the segments are also classified using the word class information. The same amount of information is inserted into each of the segment classes. The signal is encoded by modifying some inter-word spaces statistics of segment classes. Subjective comparisons with conventional word-shift algorithms are presented under several criteria.

Exclusion of Non-similar Candidates using Positional Accuracy based on Levenstein Distance from N-best Recognition Results of Isolated Word Recognition (레벤스타인 거리에 기초한 위치 정확도를 이용한 고립 단어 인식 결과의 비유사 후보 단어 제외)

  • Yun, Young-Sun;Kang, Jeom-Ja
    • Phonetics and Speech Sciences
    • /
    • v.1 no.3
    • /
    • pp.109-115
    • /
    • 2009
  • Many isolated word recognition systems may generate non-similar words for recognition candidates because they use only acoustic information. In this paper, we investigate several techniques which can exclude non-similar words from N-best candidate words by applying Levenstein distance measure. At first, word distance method based on phone and syllable distances are considered. These methods use just Levenstein distance on phones or double Levenstein distance algorithm on syllables of candidates. Next, word similarity approaches are presented that they use characters' position information of word candidates. Each character's position is labeled to inserted, deleted, and correct position after alignment between source and target string. The word similarities are obtained from characters' positional probabilities which mean the frequency ratio of the same characters' observations on the position. From experimental results, we can find that the proposed methods are effective for removing non-similar words without loss of system performance from the N-best recognition candidates of the systems.

  • PDF

Sentence model based subword embeddings for a dialog system

  • Chung, Euisok;Kim, Hyun Woo;Song, Hwa Jeon
    • ETRI Journal
    • /
    • v.44 no.4
    • /
    • pp.599-612
    • /
    • 2022
  • This study focuses on improving a word embedding model to enhance the performance of downstream tasks, such as those of dialog systems. To improve traditional word embedding models, such as skip-gram, it is critical to refine the word features and expand the context model. In this paper, we approach the word model from the perspective of subword embedding and attempt to extend the context model by integrating various sentence models. Our proposed sentence model is a subword-based skip-thought model that integrates self-attention and relative position encoding techniques. We also propose a clustering-based dialog model for downstream task verification and evaluate its relationship with the sentence-model-based subword embedding technique. The proposed subword embedding method produces better results than previous methods in evaluating word and sentence similarity. In addition, the downstream task verification, a clustering-based dialog system, demonstrates an improvement of up to 4.86% over the results of FastText in previous research.

SSF: Sentence Similar Function Based on word2vector Similar Elements

  • Yuan, Xinpan;Wang, Songlin;Wan, Lanjun;Zhang, Chengyuan
    • Journal of Information Processing Systems
    • /
    • v.15 no.6
    • /
    • pp.1503-1516
    • /
    • 2019
  • In this paper, to improve the accuracy of long sentence similarity calculation, we proposed a sentence similarity calculation method based on a system similarity function. The algorithm uses word2vector as the system elements to calculate the sentence similarity. The higher accuracy of our algorithm is derived from two characteristics: one is the negative effect of penalty item, and the other is that sentence similar function (SSF) based on word2vector similar elements doesn't satisfy the exchange rule. In later studies, we found the time complexity of our algorithm depends on the process of calculating similar elements, so we build an index of potentially similar elements when training the word vector process. Finally, the experimental results show that our algorithm has higher accuracy than the word mover's distance (WMD), and has the least query time of three calculation methods of SSF.