• Title/Summary/Keyword: text vector

Search Result 284, Processing Time 0.021 seconds

Topic Classification for Suicidology

  • Read, Jonathon;Velldal, Erik;Ovrelid, Lilja
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.2
    • /
    • pp.143-150
    • /
    • 2012
  • Computational techniques for topic classification can support qualitative research by automatically applying labels in preparation for qualitative analyses. This paper presents an evaluation of supervised learning techniques applied to one such use case, namely, that of labeling emotions, instructions and information in suicide notes. We train a collection of one-versus-all binary support vector machine classifiers, using cost-sensitive learning to deal with class imbalance. The features investigated range from a simple bag-of-words and n-grams over stems, to information drawn from syntactic dependency analysis and WordNet synonym sets. The experimental results are complemented by an analysis of systematic errors in both the output of our system and the gold-standard annotations.

The bootstrap VQ model for automatic speaker recognition system (VQ 방식의 화자인식 시스템 성능 향상을 위한 부쓰트랩 방식 적용)

  • Kyung YounJeong;Lee Jin-Ick;Lee Hwang-Soo
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • spring
    • /
    • pp.39-42
    • /
    • 2000
  • A bootstrap and aggregating (bagging) vector quantization (VQ) classifier is proposed for speaker recognition. This method obtains multiple training data sets by resampling the original training data set, and then integrates the corresponding multiple classifiers into a single classifier. Experiments involving a closed set, text-independent and speaker identification system are carried out using the TIMIT database. The proposed bagging VQ classifier shows considerably improved performance over the conventional VQ classifier.

  • PDF

Robust Algorithms for Combining Multiple Term Weighting Vectors for Document Classification

  • Kim, Minyoung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.16 no.2
    • /
    • pp.81-86
    • /
    • 2016
  • Term weighting is a popular technique that effectively weighs the term features to improve accuracy in document classification. While several successful term weighting algorithms have been suggested, none of them appears to perform well consistently across different data domains. In this paper we propose several reasonable methods to combine different term weight vectors to yield a robust document classifier that performs consistently well on diverse datasets. Specifically we suggest two approaches: i) learning a single weight vector that lies in a convex hull of the base vectors while minimizing the class prediction loss, and ii) a mini-max classifier that aims for robustness of the individual weight vectors by minimizing the loss of the worst-performing strategy among the base vectors. We provide efficient solution methods for these optimization problems. The effectiveness and robustness of the proposed approaches are demonstrated on several benchmark document datasets, significantly outperforming the existing term weighting methods.

Approach for visualizing research trends: three-dimensional visualization of documents and analysis of relative vitalization

  • Yea, Sang-Jun;Kim, Chul
    • International Journal of Contents
    • /
    • v.10 no.1
    • /
    • pp.29-35
    • /
    • 2014
  • This paper proposes an approach for visualizing research trends using theme maps and extra information. The proposed algorithm includes the following steps. First, text mining is used to construct a vector space of keywords. Second, correspondence analysis is employed to reduce high-dimensionality and to express relationships between documents and keywords. Third, kernel density estimation is applied in order to generate three-dimensional data that can show the concentration of the set of documents. Fourth, a cartographical concept is adapted for visualizing research trends. Finally, relative vitalization information is provided for more accurate research trend analysis. The algorithm of the proposed approach is tested using papers about Traditional Korean Medicine.

Finding approximate occurrence of a pattern that contains gaps by the bit-vector approach

  • Lee, In-Bok;Park, Kun-Soo
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2003.10a
    • /
    • pp.193-199
    • /
    • 2003
  • The application of finding occurrences of a pattern that contains gaps includes information retrieval, data mining, and computational biology. As the biological sequences may contain errors, it is important to find not only the exact occurrences of a pattern but also approximate ones. In this paper we present an O(mnk$_{max}$/w) time algorithm for the approximate gapped pattern matching problem, where m is the length of the text, H is the length of the pattern, w is the word size of the target machine, and k$_{max}$ is the greatest error bound for subpatterns.

  • PDF

Reputation Analysis of Document Using Probabilistic Latent Semantic Analysis Based on Weighting Distinctions (가중치 기반 PLSA를 이용한 문서 평가 분석)

  • Cho, Shi-Won;Lee, Dong-Wook
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.58 no.3
    • /
    • pp.632-638
    • /
    • 2009
  • Probabilistic Latent Semantic Analysis has many applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. In this paper, we propose an algorithm using weighted Probabilistic Latent Semantic Analysis Model to find the contextual phrases and opinions from documents. The traditional keyword search is unable to find the semantic relations of phrases, Overcoming these obstacles requires the development of techniques for automatically classifying semantic relations of phrases. Through experiments, we show that the proposed algorithm works well to discover semantic relations of phrases and presents the semantic relations of phrases to the vector-space model. The proposed algorithm is able to perform a variety of analyses, including such as document classification, online reputation, and collaborative recommendation.

Hand Gesture based Manipulation of Meeting Data in Teleconference (핸드제스처를 이용한 원격미팅 자료 인터페이스)

  • Song, Je-Hoon;Choi, Ki-Ho;Kim, Jong-Won;Lee, Yong-Gu
    • Korean Journal of Computational Design and Engineering
    • /
    • v.12 no.2
    • /
    • pp.126-136
    • /
    • 2007
  • Teleconferences have been used in business sectors to reduce traveling costs. Traditionally, specialized telephones that enabled multiparty conversations were used. With the introduction of high speed networks, we now have high definition videos that add more realism in the presence of counterparts who could be thousands of miles away. This paper presents a new technology that adds even more realism by telecommunicating with hand gestures. This technology is part of a teleconference system named SMS (Smart Meeting Space). In SMS, a person can use hand gestures to manipulate meeting data that could be in the form of text, audio, video or 3D shapes. Fer detecting hand gestures, a machine learning algorithm called SVM (Support Vector Machine) has been used. For the prototype system, a 3D interaction environment has been implemented with $OpenGL^{TM}$, where a 3D human skull model can be grasped and moved in 6-DOF during a remote conversation between distant persons.

A Sentiment Classification System Using Feature Extraction from Seed Words and Support Vector Machine (종자 어휘를 이용한 자질 추출과 지지 벡터 기계(SVM)을 이용한 문서 감정 분류 시스템의 개발)

  • Hwang, Jae-Won;Jeon, Tae-Gyun;Ko, Young-Joong
    • 한국HCI학회:학술대회논문집
    • /
    • 2007.02a
    • /
    • pp.938-942
    • /
    • 2007
  • 신문 기사 및 상품 평은 특정 주제나 상품을 대상으로 하여 글쓴이의 감정과 의견이 잘 나타나 있는 대표적인 문서이다. 최근 여론 조사 및 상품 의견 조사 등 다양한 측면에서 대용량의 문서의 의미적 분류 및 분석이 요구되고 있다. 본 논문에서는 문서에 나타난 내용을 기준으로 문서가 나타내고 있는 감정을 긍정과 부정의 두 가지 범주로 분류하는 시스템을 구현한다. 문서 분류의 시작은 감정을 지닌 대표적인 종자 어휘(seed word)로부터 시작하며, 자질의 선정은 한국어 특징상 감정 및 감각을 표현하는 명사, 형용사, 부사, 동사를 대상으로 한다. 가중치 부여 방법은 한글 유의어 사전을 통해 종자 어휘의 의미를 확장하여 각각의 가중치를 책정한다. 단어 벡터로 표현된 입력 문서를 이진 분류기인 지지벡터 기계를 이용하여 문서에 나타난 감정을 판단하는 시스템을 구현하고 그 성능을 평가한다.

  • PDF

A Hangul Document Classification System using Case-based Reasoning (사례기반 추론을 이용한 한글 문서분류 시스템)

  • Lee, Jae-Sik;Lee, Jong-Woon
    • Asia pacific journal of information systems
    • /
    • v.12 no.2
    • /
    • pp.179-195
    • /
    • 2002
  • In this research, we developed an efficient Hangul document classification system for text mining. We mean 'efficient' by maintaining an acceptable classification performance while taking shorter computing time. In our system, given a query document, k documents are first retrieved from the document case base using the k-nearest neighbor technique, which is the main algorithm of case-based reasoning. Then, TFIDF method, which is the traditional vector model in information retrieval technique, is applied to the query document and the k retrieved documents to classify the query document. We call this procedure 'CB_TFIDF' method. The result of our research showed that the classification accuracy of CB_TFIDF was similar to that of traditional TFIDF method. However, the average time for classifying one document decreased remarkably.