• Title/Summary/Keyword: TF-IDF Model

Search Result 76, Processing Time 0.029 seconds

Style-Specific Language Model Adaptation using TF*IDF Similarity for Korean Conversational Speech Recognition

  • Park, Young-Hee;Chung, Min-Hwa
    • The Journal of the Acoustical Society of Korea
    • /
    • v.23 no.2E
    • /
    • pp.51-55
    • /
    • 2004
  • In this paper, we propose a style-specific language model adaptation scheme using n-gram based tf*idf similarity for Korean spontaneous speech recognition. Korean spontaneous speech shows especially different style-specific characteristics such as filled pauses, word omission, and contraction, which are related to function words and depend on preceding or following words. To reflect these style-specific characteristics and overcome insufficient data for training language model, we estimate in-domain dependent n-gram model by relevance weighting of out-of-domain text data according to their n-. gram based tf*idf similarity, in which in-domain language model include disfluency model. Recognition results show that n-gram based tf*idf similarity weighting effectively reflects style difference.

Document Clustering with Relational Graph Of Common Phrase and Suffix Tree Document Model (공통 Phrase의 관계 그래프와 Suffix Tree 문서 모델을 이용한 문서 군집화 기법)

  • Cho, Yoon-Ho;Lee, Sang-Keun
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.2
    • /
    • pp.142-151
    • /
    • 2009
  • Previous document clustering method, NSTC measures similarities between two document pairs using TF-IDF during web document clustering. In this paper, we propose new similarity measure using common phrase-based relational graph, not TF-IDF. This method suggests that weighting common phrases by relational graph presenting relationship among common phrases in document collection. And experimental results indicate that proposed method is more effective in clustering document collection than NSTC.

Keyword Extraction from News Corpus using Modified TF-IDF (TF-IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법)

  • Lee, Sung-Jick;Kim, Han-Joon
    • The Journal of Society for e-Business Studies
    • /
    • v.14 no.4
    • /
    • pp.59-73
    • /
    • 2009
  • Keyword extraction is an important and essential technique for text mining applications such as information retrieval, text categorization, summarization and topic detection. A set of keywords extracted from a large-scale electronic document data are used for significant features for text mining algorithms and they contribute to improve the performance of document browsing, topic detection, and automated text classification. This paper presents a keyword extraction technique that can be used to detect topics for each news domain from a large document collection of internet news portal sites. Basically, we have used six variants of traditional TF-IDF weighting model. On top of the TF-IDF model, we propose a word filtering technique called 'cross-domain comparison filtering'. To prove effectiveness of our method, we have analyzed usefulness of keywords extracted from Korean news articles and have presented changes of the keywords over time of each news domain.

  • PDF

Construction of Onion Sentiment Dictionary using Cluster Analysis (군집분석을 이용한 양파 감성사전 구축)

  • Oh, Seungwon;Kim, Min Soo
    • Journal of the Korean Data Analysis Society
    • /
    • v.20 no.6
    • /
    • pp.2917-2932
    • /
    • 2018
  • Many researches are accomplished as a result of the efforts of developing the production predicting model to solve the supply imbalance of onions which are vegetables very closely related to Korean food. But considering the possibility of storing onions, it is very difficult to solve the supply imbalance of onions only with predicting the production. So, this paper's purpose is trying to build a sentiment dictionary to predict the price of onions by using the internet articles which include the informations about the production of onions and various factors of the price, and these articles are very easy to access on our daily lives. Articles about onions are from 2012 to 2016, using TF-IDF for comparing with four kinds of TF-IDFs through the documents classification of wholesale prices of onions. As a result of classifying the positive/negative words for price by k-means clustering, DBSCAN (density based spatial cluster application with noise) clustering, GMM (Gaussian mixture model) clustering which are partitional clustering, GMM clustering is composed with three meaningful dictionaries. To compare the reasonability of these built dictionary, applying classified articles about the rise and drop of the price on logistic regression, and it shows 85.7% accuracy.

An Exploratory Study on Survey Data Categorization using DDI metadata (메타데이터를 활용한 조사자료의 문서범주화에 관한 연구)

  • Park, Ja-Hyun;Song, Min
    • Proceedings of the Korean Society for Information Management Conference
    • /
    • 2012.08a
    • /
    • pp.73-76
    • /
    • 2012
  • 본 연구는 DDI 메타데이터를 활용하여 귀납적 학습모델(supervised learning model)의 문서범주화 실험을 수행함으로써 조사자료의 체계적이고 효율적인 분류작업을 설계하는데 그 목적이 있다. 구체적으로 조사자료의 DDI 메타데이터를 대상으로 단순 TF 가중치, TF-IDF 가중치, Okapi TF 가중치에 따른 나이브 베이즈(Naive Bayes), kNN(k nearest neighbor), 결정트리(Decision tree) 분류기의 성능비교 실험을 하였다. 그 결과, 나이브 베이즈가 가장 좋은 성능을 보였으며, 단순 TF 가중치와 TF-IDF 가중치는 나이브 베이즈, kNN, 결정트리 분류기에서 동일한 성능을 보였으나, Okapi TF 가중치의 경우 나이브 베이즈에서 가장 좋은 성능을 보였다.

  • PDF

Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification (공격 메일 식별을 위한 비정형 데이터를 사용한 유전자 알고리즘 기반의 특징선택 알고리즘)

  • Hong, Sung-Sam;Kim, Dong-Wook;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.20 no.1
    • /
    • pp.1-10
    • /
    • 2019
  • Since big-data text mining extracts many features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term document matrix obtained through text mining represents term-document features, but produces a sparse matrix. We designed an advanced genetic algorithm (GA) to extract features in text mining for detection model. Term frequency inverse document frequency (TF-IDF) is used to reflect the document-term relationships in feature extraction. Through a repetitive process, a predetermined number of features are selected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitness function. We also verified its performance by applying the proposed algorithm to text classification. As a result, we have found that our algorithm shows higher performance (speed and accuracy) in attack mail classification.

Analysis of whether the feeling of relative deprivation is shown in the comments of the Luxury Howl YouTube video - Focusing on modern sentiment analysis using TF-IDF, Word2vec, LDA and LSTM - (명품 하울 유튜브 영상 댓글에 나타난 상대적 박탈감 여부와 특징 분석 - TF-IDF, Word2vec, LDA, LSTM을 이용한 현대인의 감정 분석을 중심으로 -)

  • Choi, Jung Min;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.3
    • /
    • pp.355-360
    • /
    • 2021
  • Recently Youtube has been more popular. As many studies show the comparative deprivation of the Social Medeia, this study looks into whether the comparative deprivation is expressed on the YouTube comments. It focuses on the Luxury Haul contents, videos about huge amounts of luxurious products, of which Youtubers'economic feature are demonstrative. The comments of the videos are analyzed with LDA TF-IDF and Word2Vec. Additionally, the comments were classified into positive and negative groups by the LSTM model as well. As a result of the study, even though many comments turned out positive, the negative keywords were indicated related to comparative deprivation. Also it was found that the viewers compared themselves with Youtubers. In particular, some YouTubers are more criticized if they are younger or does not seem to afford the luxurious products themselves. This study suggests that the users express the comparative deprivation on YouTube as well like on the other Social Media.

Semantic Extention Search for Documents Using the Word2vec (Word2vec을 활용한 문서의 의미 확장 검색방법)

  • Kim, Woo-ju;Kim, Dong-he;Jang, Hee-won
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.10
    • /
    • pp.687-692
    • /
    • 2016
  • Conventional way to search documents is keyword-based queries using vector space model, like tf-idf. Searching process of documents which is based on keywords can make some problems. it cannot recogize the difference of lexically different but semantically same words. This paper studies a scheme of document search based on document queries. In particular, it uses centrality vectors, instead of tf-idf vectors, to represent query documents, combined with the Word2vec method to capture the semantic similarity in contained words. This scheme improves the performance of document search and provides a way to find documents not only lexically, but semantically close to a query document.

Analysis of User Reviews for Webtoon Applications Using Text Mining (텍스트 마이닝을 활용한 웹툰 애플리케이션 사용자 리뷰 분석)

  • Shin, Hyorim;Choi, Junho
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.4
    • /
    • pp.457-468
    • /
    • 2022
  • With the rapid growth of the webtoon industry, a new model for webtoon applications has emerged. We have entered the era of webtoon application version 3.0 after ver 1.0 and ver 2.0. Despite these changes, research on user review analysis for webtoon applications is still insufficient. Therefore, this study aims to analyze user reviews for 'Kakao Webtoon (Daum Webtoon)' that presented the webtoon application 3.0 model. For analysis, 20,382 application reviews were collected and pre-processed, and TF-IDF, network analysis, topic modeling, and emotional analysis were conducted for each version. As a result, the user experience of the webtoon application for each version was analyzed and usability testing conducted.

Document classification using a deep neural network in text mining (텍스트 마이닝에서 심층 신경망을 이용한 문서 분류)

  • Lee, Bo-Hui;Lee, Su-Jin;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.5
    • /
    • pp.615-625
    • /
    • 2020
  • The document-term frequency matrix is a term extracted from documents in which the group information exists in text mining. In this study, we generated the document-term frequency matrix for document classification according to research field. We applied the traditional term weighting function term frequency-inverse document frequency (TF-IDF) to the generated document-term frequency matrix. In addition, we applied term frequency-inverse gravity moment (TF-IGM). We also generated a document-keyword weighted matrix by extracting keywords to improve the document classification accuracy. Based on the keywords matrix extracted, we classify documents using a deep neural network. In order to find the optimal model in the deep neural network, the accuracy of document classification was verified by changing the number of hidden layers and hidden nodes. Consequently, the model with eight hidden layers showed the highest accuracy and all TF-IGM document classification accuracy (according to parameter changes) were higher than TF-IDF. In addition, the deep neural network was confirmed to have better accuracy than the support vector machine. Therefore, we propose a method to apply TF-IGM and a deep neural network in the document classification.