• Title/Summary/Keyword: TF-IDF 가중치

Search Result 63, Processing Time 0.031 seconds

Document classification using a deep neural network in text mining (텍스트 마이닝에서 심층 신경망을 이용한 문서 분류)

  • Lee, Bo-Hui;Lee, Su-Jin;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.5
    • /
    • pp.615-625
    • /
    • 2020
  • The document-term frequency matrix is a term extracted from documents in which the group information exists in text mining. In this study, we generated the document-term frequency matrix for document classification according to research field. We applied the traditional term weighting function term frequency-inverse document frequency (TF-IDF) to the generated document-term frequency matrix. In addition, we applied term frequency-inverse gravity moment (TF-IGM). We also generated a document-keyword weighted matrix by extracting keywords to improve the document classification accuracy. Based on the keywords matrix extracted, we classify documents using a deep neural network. In order to find the optimal model in the deep neural network, the accuracy of document classification was verified by changing the number of hidden layers and hidden nodes. Consequently, the model with eight hidden layers showed the highest accuracy and all TF-IGM document classification accuracy (according to parameter changes) were higher than TF-IDF. In addition, the deep neural network was confirmed to have better accuracy than the support vector machine. Therefore, we propose a method to apply TF-IGM and a deep neural network in the document classification.

TF-IDF Based Association Rule Analysis System for Medical Data (의료 정보 추출을 위한 TF-IDF 기반의 연관규칙 분석 시스템)

  • Park, Hosik;Lee, Minsu;Hwang, Sungjin;Oh, Sangyoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.3
    • /
    • pp.145-154
    • /
    • 2016
  • Because of the recent interest in the u-Health and development of IT technology, a need of utilizing a medical information data has been increased. Among previous studies that utilize various data mining algorithms for processing medical information data, there are studies of association rule analysis. In the studies, an association between the symptoms with specified diseases is the target to discover, however, infrequent terms which can be important information for a disease diagnosis are not considered in most cases. In this paper, we proposed a new association rule mining system considering the importance of each term using TF-IDF weight to consider infrequent but important items. In addition, the proposed system can predict candidate diagnoses from medical text records using term similarity analysis based on medical ontology.

Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels (TF-IDF와 소설 텍스트의 구조를 이용한 주제어 추출 연구)

  • You, Eun-Soon;Choi, Gun-Hee;Kim, Seung-Hoon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.2
    • /
    • pp.121-129
    • /
    • 2015
  • With the explosive growth of information about books, there is a growing number of customers who find it difficult to pick a book. Against the backdrop, the importance of a book recommendation system becomes greater, through which appropriate information about books could be offered then to encourage customers to buy a book in the end. However, existing recommendation systems based on the bibliographical information or user data reveal the reliability issue found in their recommendation results. This is why it is necessary to reflect semantic information extracted from the texts of a book's main body in a recommendation system. Accordingly, this paper suggests a method for extracting keywords from the main body of novels, as a preceding research, by using TF-IDF method as well as the text structure. To this end, the texts of 100 novels have been collected then to divide them into four structural elements of preface, dialogue, non-dialogue and closing. Then, the TF-IDF weight of each keyword has been calculated. The calculation results show that the extraction accuracy of keywords improves by 42.1% in performance when more weight is given to dialogue while including preface and closing instead of using just the main body.

A Study on Optimization of Support Vector Machine Classifier for Word Sense Disambiguation (단어 중의성 해소를 위한 SVM 분류기 최적화에 관한 연구)

  • Lee, Yong-Gu
    • Journal of Information Management
    • /
    • v.42 no.2
    • /
    • pp.193-210
    • /
    • 2011
  • The study was applied to context window sizes and weighting method to obtain the best performance of word sense disambiguation using support vector machine. The context window sizes were used to a 3-word, sentence, 50-bytes, and document window around the targeted word. The weighting methods were used to Binary, Term Frequency(TF), TF ${\times}$ Inverse Document Frequency(IDF), and Log TF ${\times}$ IDF. As a result, the performance of 50-bytes in the context window size was best. The Binary weighting method showed the best performance.

Relevance Feedback Experiments for Korean Information Retrieval Systems (한국어 정보검색 시스템을 위한 다양한 적합성 피드백 방법의 실험)

  • Park, Su-Hyeon;Gwon, Hyeok-Cheol
    • Journal of KIISE:Software and Applications
    • /
    • v.26 no.5
    • /
    • pp.682-691
    • /
    • 1999
  • 정보검색 시스템의 검색 효율 향상을 위해서 다양한 적합성 피드백 방법이 개발되었다. 그러나 한국어 정보검색 시스템을 위한 적합성 피드백에 대한 연구는 거의 이루어지지 않은 실정이다. 이 논문에서는 기존에 개발된 적합성 피드백 방법을 한국어 정보 시스템에 적용하여 검색 효율을 비교하고, 새로운 적합성 피드백 방법을 개발 적용하여 기존의 방법들과 검색 효율을 비교분석하였다. 적합성 피드백은 원질의문을 확장할 단어 선택과 선택된 단어 가중치 부여로 이루어진다. 원질의문이 입력되면 검색된 적합문서에서 원질의문을 단어와 밀접한 관계가 있는 단어를 선택하기 위하여 가중치를 부가한후, 원질의문에 추가하여 질의문을 확장한다. 이 논문에서는 원질의문 확장을 위한 단어 선택과 단어 가중치 부여를 위해 3가지 값을 사용한다. 첫째, TF는 적합문서 내의 단어 빈도의 총합이다. 둘째, idf는 해당 문서집단의 역문헌빈도이다. 셋째, r/R은 검색된 적합문서 중에서 해당단어가 있는 적합문서의 비율을 나타낸다. TF와 idf는 정보검색 시스템에서 일반적으로 사용되고있는 값이고 r/R은 이 논문에서 제안한 새로운 값이다.

A Validation of Effectiveness for Intrusion Detection Events Using TF-IDF (TF-IDF를 이용한 침입탐지이벤트 유효성 검증 기법)

  • Kim, Hyoseok;Kim, Yong-Min
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.28 no.6
    • /
    • pp.1489-1497
    • /
    • 2018
  • Web application services have diversified. At the same time, research on intrusion detection is continuing due to the surge of cyber threats. Also, As a single-defense system evolves into multi-level security, we are responding to specific intrusions by correlating security events that have become vast. However, it is difficult to check the OS, service, web application type and version of the target system in real time, and intrusion detection events occurring in network-based security devices can not confirm vulnerability of the target system and success of the attack A blind spot can occur for threats that are not analyzed for problems and associativity. In this paper, we propose the validation of effectiveness for intrusion detection events using TF-IDF. The proposed scheme extracts the response traffics by mapping the response of the target system corresponding to the attack. Then, Response traffics are divided into lines and weights each line with an TF-IDF weight. we checked the valid intrusion detection events by sequentially examining the lines with high weights.

An Improved Recommendation Algorithm Based on Two-layer Attention Mechanism

  • Kim, Hye-jin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.10
    • /
    • pp.185-198
    • /
    • 2021
  • With the development of Internet technology, because traditional recommendation algorithms cannot learn the in-depth characteristics of users or items, this paper proposed a recommendation algorithm based on the AMITI(attention mechanism and improved TF-IDF) to solve this problem. By introducing the two-layer attention mechanism into the CNN, the feature extraction ability of the CNN is improved, and different preference weights are assigned to item features, recommendations that are more in line with user preferences are achieved. When recommending items to target users, the scoring data and item type data are combined with TF-IDF to complete the grouping of the recommendation results. In this paper, the experimental results on the MovieLens-1M data set show that the AMITI algorithm improves the accuracy of recommendation to a certain extent and enhances the orderliness and selectivity of presentation methods.

A Study on Improving Precision Rate in Security Events Using Cyber Attack Dictionary and TF-IDF (공격키워드 사전 및 TF-IDF를 적용한 침입탐지 정탐률 향상 연구)

  • Jongkwan Kim;Myongsoo Kim
    • Convergence Security Journal
    • /
    • v.22 no.2
    • /
    • pp.9-19
    • /
    • 2022
  • As the expansion of digital transformation, we are more exposed to the threat of cyber attacks, and many institution or company is operating a signature-based intrusion prevention system at the forefront of the network to prevent the inflow of attacks. However, in order to provide appropriate services to the related ICT system, strict blocking rules cannot be applied, causing many false events and lowering operational efficiency. Therefore, many research projects using artificial intelligence are being performed to improve attack detection accuracy. Most researches were performed using a specific research data set which cannot be seen in real network, so it was impossible to use in the actual system. In this paper, we propose a technique for classifying major attack keywords in the security event log collected from the actual system, assigning a weight to each key keyword, and then performing a similarity check using TF-IDF to determine whether an actual attack has occurred.

A Study on Negation Handling and Term Weighting Schemes and Their Effects on Mood-based Text Classification (감정 기반 블로그 문서 분류를 위한 부정어 처리 및 단어 가중치 적용 기법의 효과에 대한 연구)

  • Jung, Yu-Chul;Choi, Yoon-Jung;Myaeng, Sung-Hyon
    • Korean Journal of Cognitive Science
    • /
    • v.19 no.4
    • /
    • pp.477-497
    • /
    • 2008
  • Mood classification of blog text is an interesting problem, with a potential for a variety of services involving the Web. This paper introduces an approach to mood classification enhancements through the normalized negation n-grams which contain mood clues and corpus-specific term weighting(CSTW). We've done experiments on blog texts with two different classification methods: Enhanced Mood Flow Analysis(EMFA) and Support Vector Machine based Mood Classification(SVMMC). It proves that the normalized negation n-gram method is quite effective in dealing with negations and gave gradual improvements in mood classification with EMF A. From the selection of CSTW, we noticed that the appropriate weighting scheme is important for supporting adequate levels of mood classification performance because it outperforms the result of TF*IDF and TF.

  • PDF

A Study on the Performance Improvement of Rocchio Classifier with Term Weighting Methods (용어 가중치부여 기법을 이용한 로치오 분류기의 성능 향상에 관한 연구)

  • Kim, Pan-Jun
    • Journal of the Korean Society for information Management
    • /
    • v.25 no.1
    • /
    • pp.211-233
    • /
    • 2008
  • This study examines various weighting methods for improving the performance of automatic classification based on Rocchio algorithm on two collections(LISA, Reuters-21578). First, three factors for weighting are identified as document factor, document factor, category factor for each weighting schemes, the performance of each was investigated. Second, the performance of combined weighting methods between the single schemes were examined. As a result, for the single schemes based on each factor, category-factor-based schemes showed the best performance, document set-factor-based schemes the second, and document-factor-based schemes the worst. For the combined weighting schemes, the schemes(idf*cat) which combine document set factor with category factor show better performance than the combined schemes(tf*cat or ltf*cat) which combine document factor with category factor as well as the common schemes (tfidf or ltfidf) that combining document factor with document set factor. However, according to the results of comparing the single weighting schemes with combined weighting schemes in the view of the collections, while category-factor-based schemes(cat only) perform best on LISA, the combined schemes(idf*cat) which combine document set factor with category factor showed best performance on the Reuters-21578. Therefore for the practical application of the weighting methods, it needs careful consideration of the categories in a collection for automatic classification.