• Title/Summary/Keyword: text vector

Search Result 284, Processing Time 0.027 seconds

Hierarchical Text Categorization using Support Vector Machine (지지 벡터 기계를 이용한 계층적 문서 분류)

  • Yoon, Yong-Wook;Lee, Chang-Ki;Lee, Gary Geun-Bae
    • Annual Conference on Human and Language Technology
    • /
    • 2003.10d
    • /
    • pp.7-13
    • /
    • 2003
  • 인터넷을 통해 생성, 전달되는 문서 량이 급격히 많아짐에 따라, 정보의 접근을 용이하게 하기 위한 문서의 자동 분류 기능이 절실히 요구되고 있다. SVM(Support Vector Machine)은 최근에 문서 분류에 널리 쓰이고 있는 기법으로 다른 분류기에 비하여 좋은 성능을 보여주고 있다. 하지만 SVM은 현재까지 주로 비 계층 평탄화(flat)된 분류 응용에 효과적으로 적용되어 왔다. 이와 달리 본 논문은 문서 분류에 있어서 최종 분류 class를 한번에 출력하는 비 계층 분류보다는, 비슷한 성질을 갖는 class의 집합을 계층적 구조로 묶어 분류하는 계층적 분류 기법이 보다 사람이 이해하기 쉽고 사용하기 편리하며 더 효과적이라는 것을 보이고, 실험을 통해 계층적 분류를 위한 효과적인 SVM분류기를 개발하여 비 계층 분류보다 좋은 분류 성능을 보여 줄 수 있음을 확인한다.

  • PDF

Real-time Unknown Word Identification Using Support Vector Machine For Chinese Text-to-Speech (중국어 음성합성을 위한 지진 벡터 기반 실시간 미등록어 처리)

  • Ha, Ju-Hong;Zheng, Yu;Lee, Gary G.
    • Annual Conference on Human and Language Technology
    • /
    • 2003.10d
    • /
    • pp.267-272
    • /
    • 2003
  • 음성 합성 시스템 구축에 있어서 입력 텍스트를 정확한 발음 표기로 변환하는 것은 매우 중요하다. 중국어에는 하나의 한자가 의미나 사용에 따라 다르게 발음되는 다음자(polyphony)들이 존재한다. 다음자의 처리는 상당히 복잡한 문제이기 때문에 본 논문에서는 그 중 가장 발음에 영향을 미치는 요소인 인명과 지명에 대한 미등록어 처리를 수행했다. 무엇보다 실시간 음성 합성 시스템을 위해서는 처리 속도의 향상이 요구된다. 따라서 본 연구에서는 미등록어 후보 구간 선정을 선행하고, 선정된 후보에 대해 추정하는 두 단계로 진행하였다. 후보 구간 선정은 단일 한자 단어(monosyllable word)의 확률과 간단한 패턴들을 이용한다. 최종 선정된 후보의 미등록어 추정은 SVM(Support Vector Machine)을 기반으로 실시하였다.

  • PDF

Text Classification for Patents: Experiments with Unigrams, Bigrams and Different Weighting Methods

  • Im, ChanJong;Kim, DoWan;Mandl, Thomas
    • International Journal of Contents
    • /
    • v.13 no.2
    • /
    • pp.66-74
    • /
    • 2017
  • Patent classification is becoming more critical as patent filings have been increasing over the years. Despite comprehensive studies in the area, there remain several issues in classifying patents on IPC hierarchical levels. Not only structural complexity but also shortage of patents in the lower level of the hierarchy causes the decline in classification performance. Therefore, we propose a new method of classification based on different criteria that are categories defined by the domain's experts mentioned in trend analysis reports, i.e. Patent Landscape Report (PLR). Several experiments were conducted with the purpose of identifying type of features and weighting methods that lead to the best classification performance using Support Vector Machine (SVM). Two types of features (noun and noun phrases) and five different weighting schemes (TF-idf, TF-rf, TF-icf, TF-icf-based, and TF-idcef-based) were experimented on.

Test Vector Generator of timing simulation for 224-bit ECDSA hardware (224비트 ECDSA 하드웨어 시간 시뮬레이션을 위한 테스트벡터 생성기)

  • Kim, Tae Hun;Jung, Seok Won
    • Journal of Internet of Things and Convergence
    • /
    • v.1 no.1
    • /
    • pp.33-38
    • /
    • 2015
  • Hardware are developed in various architecture. It is necessary to verifying value of variables in modules generated in each clock cycles for timing simulation. In this paper, a test vector generator in software type generates test vectors for timing simulation of 224-bit ECDSA hardware modules in developing stage. It provides test vectors with GUI format and text file format.

A Dataset of Online Handwritten Assamese Characters

  • Baruah, Udayan;Hazarika, Shyamanta M.
    • Journal of Information Processing Systems
    • /
    • v.11 no.3
    • /
    • pp.325-341
    • /
    • 2015
  • This paper describes the Tezpur University dataset of online handwritten Assamese characters. The online data acquisition process involves the capturing of data as the text is written on a digitizer with an electronic pen. A sensor picks up the pen-tip movements, as well as pen-up/pen-down switching. The dataset contains 8,235 isolated online handwritten Assamese characters. Preliminary results on the classification of online handwritten Assamese characters using the above dataset are presented in this paper. The use of the support vector machine classifier and the classification accuracy for three different feature vectors are explored in our research.

Matching Algorithm for Hangul Recognition Based on PDA

  • Kim Hyeong-Gyun;Choi Gwang-Mi
    • Journal of information and communication convergence engineering
    • /
    • v.2 no.3
    • /
    • pp.161-166
    • /
    • 2004
  • Electronic Ink is a stored data in the form of the handwritten text or the script without converting it into ASCII by handwritten recognition on the pen-based computers and Personal Digital Assistants(PDA) for supporting natural and convenient data input. One of the most important issue is to search the electronic ink in order to use it. We proposed and implemented a script matching algorithm for the electronic ink. Proposed matching algorithm separated the input stroke into a set of primitive stroke using the curvature of the stroke curve. After determining the type of separated strokes, it produced a stroke feature vector. And then it calculated the distance between the stroke feature vector of input strokes and one of strokes in the database using the dynamic programming technique.

New Feature Selection Method for Text Categorization

  • Wang, Xingfeng;Kim, Hee-Cheol
    • Journal of information and communication convergence engineering
    • /
    • v.15 no.1
    • /
    • pp.53-61
    • /
    • 2017
  • The preferred feature selection methods for text classification are filter-based. In a common filter-based feature selection scheme, unique scores are assigned to features; then, these features are sorted according to their scores. The last step is to add the top-N features to the feature set. In this paper, we propose an improved global feature selection scheme wherein its last step is modified to obtain a more representative feature set. The proposed method aims to improve the classification performance of global feature selection methods by creating a feature set representing all classes almost equally. For this purpose, a local feature selection method is used in the proposed method to label features according to their discriminative power on classes; these labels are used while producing the feature sets. Experimental results obtained using the well-known 20 Newsgroups and Reuters-21578 datasets with the k-nearest neighbor algorithm and a support vector machine indicate that the proposed method improves the classification performance in terms of a widely known metric ($F_1$).

Hybrid Approach to Sentiment Analysis based on Syntactic Analysis and Machine Learning (구문분석과 기계학습 기반 하이브리드 텍스트 논조 자동분석)

  • Hong, Mun-Pyo;Shin, Mi-Young;Park, Shin-Hye;Lee, Hyung-Min
    • Language and Information
    • /
    • v.14 no.2
    • /
    • pp.159-181
    • /
    • 2010
  • This paper presents a hybrid approach to the sentiment analysis of online texts. The sentiment of a text refers to the feelings that the author of a text has towards a certain topic. Many existing approaches employ either a pattern-based approach or a machine learning based approach. The former shows relatively high precision in classifying the sentiments, but suffers from the data sparseness problem, i.e. the lack of patterns. The latter approach shows relatively lower precision, but 100% recall. The approach presented in the current work adopts the merits of both approaches. It combines the pattern-based approach with the machine learning based approach, so that the relatively high precision and high recall can be maintained. Our experiment shows that the hybrid approach improves the F-measure score for more than 50% in comparison with the pattern-based approach and for around 1% comparing with the machine learning based approach. The numerical improvement from the machine learning based approach might not seem to be quite encouraging, but the fact that in the current approach not only the sentiment or the polarity information of sentences but also the additional information such as target of sentiments can be classified makes the current approach promising.

  • PDF

Design and Implementation of Self-networking and Replaceable Structure in Mobile Vector Graphics

  • Jeong Gu-Min;Na Seung-Won;Jung Doo-Hee;Lee Yang-Sun
    • Journal of Korea Multimedia Society
    • /
    • v.8 no.6
    • /
    • pp.827-835
    • /
    • 2005
  • In this paper, self-networking and replaceable structure in vector graphics contents are presented for wireless internet service. The wireless networks over 2G or 3G are limited in the sense of the speed and the cost. Considering these characteristics of wireless network, self-networking method and replaceable structure in downloaded contents are introduced in order to save the amount of data and provide variations for contents. During the display of contents, a certain data for the contents is downloaded from the server and it is managed appropriately for the operation of the contents. The downloaded materials are reflected to the original contents using replaceable structure. Also, the downloading and modification are independent of the play. In this implementation, the data consists of control data for control and resource data for image, sound or text. Comparing to the conventional methods which download the whole data, the amount of the transmitted data is very small since only the difference is downloaded. Also, during the play of the contents, the changes are adopted immediately. The whole functions are implemented in wireless handset and the various applications are discussed.

  • PDF

Speaker Identification Based on Vowel Classification and Vector Quantization (모음 인식과 벡터 양자화를 이용한 화자 인식)

  • Lim, Chang-Heon;Lee, Hwang-Soo;Un, Chong-Kwan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.8 no.4
    • /
    • pp.65-73
    • /
    • 1989
  • In this paper, we propose a text-independent speaker identification algorithm based on VQ(vector quantization) and vowel classification, and its performance is studied and compared with that of a conventional speaker identification algorithm using VQ. The proposed speaker identification algorithm is composed of three processes: vowel segmentation, vowel recognition and average distortion calculation. The vowel segmentation is performed automatlcally using RMS energy, BTR(Back-to-Total cavity volume Ratio)and SFBR(Signed Front-to-Back maximum area Ratio) extracted from input speech signal. If the Input speech signal Is noisy, particularity when the SNR is around 20dB, the proposed speaker identification algorithm performs better than the reference speaker identification algorithm when the correct vowel segmentation is done. The same result is obtained when we use the noisy telephone speech signal as an input, too.

  • PDF