• Title/Summary/Keyword: 역문서 빈도

Search Result 31, Processing Time 0.024 seconds

A Study on Korean Fake news Detection Model Using Word Embedding (워드 임베딩을 활용한 한국어 가짜뉴스 탐지 모델에 관한 연구)

  • Shim, Jae-Seung;Lee, Jaejun;Jeong, Ii Tae;Ahn, Hyunchul
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2020.07a
    • /
    • pp.199-202
    • /
    • 2020
  • 본 논문에서는 가짜뉴스 탐지 모델에 워드 임베딩 기법을 접목하여 성능을 향상시키는 방법을 제안한다. 기존의 한국어 가짜뉴스 탐지 연구는 희소 표현인 빈도-역문서 빈도(TF-IDF)를 활용한 탐지 모델들이 주를 이루었다. 하지만 이는 가짜뉴스 탐지의 관점에서 뉴스의 언어적 특성을 파악하는 데 한계가 존재하는데, 특히 문맥에서 드러나는 언어적 특성을 구조적으로 반영하지 못한다. 이에 밀집 표현 기반의 워드 임베딩 기법인 Word2vec을 활용한 텍스트 전처리를 통해 문맥 정보까지 반영한 가짜뉴스 탐지 모델을 본 연구의 제안 모델로 생성한 후 TF-IDF 기반의 가짜뉴스 탐지 모델을 비교 모델로 생성하여 두 모델 간의 비교를 통한 성능 검증을 수행하였다. 그 결과 Word2vec 기반의 제안모형이 더욱 우수하였음을 확인하였다.

  • PDF

Clustering of Web Document Exploiting with the Co-link in Hypertext (동시링크를 이용한 웹 문서 클러스터링 실험)

  • 김영기;이원희;권혁철
    • Journal of Korean Library and Information Science Society
    • /
    • v.34 no.2
    • /
    • pp.233-253
    • /
    • 2003
  • Knowledge organization is the way we humans understand the world. There are two types of information organization mechanisms studied in information retrieval: namely classification md clustering. Classification organizes entities by pigeonholing them into predefined categories, whereas clustering organizes information by grouping similar or related entities together. The system of the Internet information resources extracts a keyword from the words which appear in the web document and draws up a reverse file. Term clustering based on grouping related terms, however, did not prove overly successful and was mostly abandoned in cases of documents used different languages each other or door-way-pages composed of only an anchor text. This study examines infometric analysis and clustering possibility of web documents based on co-link topology of web pages.

  • PDF

A Focused Crawler by Segmentation of Context Information (주변정보 분할을 이용한 주제 중심 웹 문서 수집기)

  • Cho, Chang-Hee;Lee, Nam-Yong;Kang, Jin-Bum;Yang, Jae-Young;Choi, Joong-Min
    • The KIPS Transactions:PartB
    • /
    • v.12B no.6 s.102
    • /
    • pp.697-702
    • /
    • 2005
  • The focused crawler is a topic-driven document-collecting crawler that was suggested as a promising alternative of maintaining up-to-date web document Indices in search engines. A major problem inherent in previous focused crawlers is the liability of missing highly relevant documents that are linked from off-topic documents. This problem mainly originated from the lack of consideration of structural information in a document. Traditional weighting method such as TFIDF employed in document classification can lead to this problem. In order to improve the performance of focused crawlers, this paper proposes a scheme of locality-based document segmentation to determine the relevance of a document to a specific topic. We segment a document into a set of sub-documents using contextual features around the hyperlinks. This information is used to determine whether the crawler would fetch the documents that are linked from hyperlinks in an off-topic document.

An Automatic Classification System of Official Documents in Middle Schools Using Term Weighting of Titles (제목의 단어 가중치를 이용한 중등학교 공문서 자동분류시스템)

  • Kang, Hyun-Hee;Jin, Min
    • Journal of The Korean Association of Information Education
    • /
    • v.7 no.2
    • /
    • pp.219-226
    • /
    • 2003
  • It takes a lot of time to classify official documents in schools and educational institutions. In order to reduce the overhead, we propose an automatic document classification method using word information of the titles of documents in this paper. At first, meaningful words are extracted from titles of existing documents and Inverse Document Frequency(IDF) weights of words are calculated against each category. Then we build a word weight dictionary. Documents are automatically classified into the appropriate category of which the sum of weights of words of the title is the highest by using the word weight dictionary. We also evaluate the performance of the proposed method using a real dataset of a middle school.

  • PDF

Clustering Meta Information of K-Pop Girl Groups Using Term Frequency-inverse Document Frequency Vectorization (단어-역문서 빈도 벡터화를 통한 한국 걸그룹의 음반 메타 정보 군집화)

  • JoonSeo Hyeon;JaeHyuk Cho
    • Journal of Platform Technology
    • /
    • v.11 no.3
    • /
    • pp.12-23
    • /
    • 2023
  • In the 2020s, the K-Pop market has been dominated by girl groups over boy groups and the fourth generation over the third generation. This paper presents methods and results on lyric clustering to investigate whether the generation of girl groups has started to change. We collected meta-information data for 1469 songs of 47 groups released from 2013 to 2022 and classified them into lyric information and non-lyric meta-information and quantified them respectively. The lyrics information was preprocessed by applying word-translation frequency vectorization based on previous studies and then selecting only the top vector values. Non-lyric meta-information was preprocessed and applied with One-Hot Encoding to reduce the bias of using only lyric information and show better clustering results. The clustering performance on the preprocessed data is 129%, 45% higher for Spherical K-Means' Silhouette Score and Calinski-Harabasz Score, respectively, compared to Hierarchical Clustering. This paper is expected to contribute to the study of Korean popular song development and girl group lyrics analysis and clustering.

  • PDF

Analysis of User Reviews of Running Applications Using Text Mining: Focusing on Nike Run Club and Runkeeper (텍스트마이닝을 활용한 러닝 어플리케이션 사용자 리뷰 분석: Nike Run Club과 Runkeeper를 중심으로)

  • Gimun Ryu;Ilgwang Kim
    • Journal of Industrial Convergence
    • /
    • v.22 no.4
    • /
    • pp.11-19
    • /
    • 2024
  • The purpose of this study was to analyze user reviews of running applications using text mining. This study used user reviews of Nike Run Club and Runkeeper in the Google Play Store using the selenium package of python3 as the analysis data, and separated the morphemes by leaving only Korean nouns through the OKT analyzer. After morpheme separation, we created a rankNL dictionary to remove stopwords. To analyze the data, we used TF, TF-IDF and LDA topic modeling in text mining. The results of this study are as follows. First, the keywords 'record', 'app', and 'workout' were identified as the top keywords in the user reviews of Nike Run Club and Runkeeper applications, and there were differences in the rankings of TF and TF-IDF. Second, the LDA topic modeling of Nike Run Club identified the topics of 'basic items', 'additional features', 'errors', and 'location-based data', and the topics of Runkeeper identified the topics of 'errors', 'voice function', 'running data', 'benefits', and 'motivation'. Based on the results, it is recommended that errors and improvements should be made to contribute to the competitiveness of the application.

Research of Term-Weighting Method in an Usenet Information Retrieval System (유즈넷 정보검색시스템에서 단어 가중치 적용방법에 관한연구)

  • 최재덕;최진석;박민식
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1998.10b
    • /
    • pp.339-341
    • /
    • 1998
  • 다양한 정보교환 수단의 하나인 유즈넷은 방대한 정보량을 가진다. 사용자는 유즈넷에서 필요한 정보를 쉽게 찾지 못하므로 뉴스그룹 전체와 본문에서 정보 검색의 필요성을 인식하고 있다. 이 논문에서는 정보검색시스템을 유즈넷으로 확장시 단어 가중치 적용방법의 개선을 통해 검색효율을 향상시키고자 한다. 정보검색에서 단어의 중요도에 영향을 미치는 tf, idf 이외의 다른 요소인 카테고리빈도(category frequency, cf)를 활용하여 tf*idf방법에 역카테고리빈도(inverted categoary frequency, icf)를 고려한 유사도 계산 방법을 제시하고 이를 검증하였다. 실험 결과에서 상위 30위 내의 평균 적합문서의 수가 tf*{{{{ SQRT {idf$^2$+icf$^2$} }}}}방법이 tf*idf 방법보다 4.6% 향상됨을 알 수 있다.

Korea National College of Agriculture and Fisheries in Naver News by Web Crolling : Based on Keyword Analysis and Semantic Network Analysis (웹 크롤링에 의한 네이버 뉴스에서의 한국농수산대학 - 키워드 분석과 의미연결망분석 -)

  • Joo, J.S.;Lee, S.Y.;Kim, S.H.;Park, N.B.
    • Journal of Practical Agriculture & Fisheries Research
    • /
    • v.23 no.2
    • /
    • pp.71-86
    • /
    • 2021
  • This study was conducted to find information on the university's image from words related to 'Korea National College of Agriculture and Fisheries (KNCAF)' in Naver News. For this purpose, word frequency analysis, TF-IDF evaluation and semantic network analysis were performed using web crawling technology. In word frequency analysis, 'agriculture', 'education', 'support', 'farmer', 'youth', 'university', 'business', 'rural', 'CEO' were important words. In the TF-IDF evaluation, the key words were 'farmer', 'dron', 'agricultural and livestock food department', 'Jeonbuk', 'young farmer', 'agriculture', 'Chonju', 'university', 'device', 'spreading'. In the semantic network analysis, the Bigrams showed high correlations in the order of 'youth' - 'farmer', 'digital' - 'agriculture', 'farming' - 'settlement', 'agriculture' - 'rural', 'digital' - 'turnover'. As a result of evaluating the importance of keywords as five central index, 'agriculture' ranked first. And the keywords in the second place of the centrality index were 'farmers' (Cc, Cb), 'education' (Cd, Cp) and 'future' (Ce). The sperman's rank correlation coefficient by centrality index showed the most similar rank between Degree centrality and Pagerank centrality. The KNCAF articles of Naver News were used as important words such as 'agriculture', 'education', 'support', 'farmer', 'youth' in terms of word frequency. However, in the evaluation including document frequency, the words such as 'farmer', 'dron', 'Ministry of Agriculture, Food and Rural Affairs', 'Jeonbuk', and 'young farmers' were found to be key words. The centrality analysis considering the network connectivity between words was suitable for evaluation by Cd and Cp. And the words with strong centrality were 'agriculture', 'education', 'future', 'farmer', 'digital', 'support', 'utilization'.

Design of WWW IR System Based on Keyword Clustering Architecture (색인어 말뭉치 처리를 기반으로 한 웹 정보검색 시스템의 설계)

  • 송점동;이정현;최준혁
    • The Journal of Information Technology
    • /
    • v.1 no.1
    • /
    • pp.13-26
    • /
    • 1998
  • In general Information retrieval systems, improper keywords are often extracted and different search results are offered comparing to user's aim bacause the systems use only term frequency informations for selecting keywords and don't consider their meanings. It represents that improving precision is limited without considering semantics of keywords because recall ratio and precision have inverse proportion relation. In this paper, a system which is able to improve precision without decreasing recall ratio is designed and implemented, as client user module is introduced which can send feedbacks to server with user's intention. For this purpose, keywords are selected using relative term frequency and inverse document frequency and co-occurrence words are extracted from original documents. Then, the keywords are clustered by their semantics using calculated mutual informations. In this paper, the system can reject inappropriate documents using segmented semantic informations according to feedbacks from client user module. Consequently precision of the system is improved without decreasing recall ratio.

  • PDF

Convolutional Neural Network-based Malware Classification Method utilizing Local Feature-based Global Image (로컬 특징 기반 글로벌 이미지를 사용한 CNN 기반의 악성코드 분류 방법)

  • Jang, Sejun;Sung, Yunsick
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.05a
    • /
    • pp.222-223
    • /
    • 2020
  • 최근 악성코드로 인한 피해가 증가하고 있다. 악성코드는 악성코드가 속한 종류에 따라서 대응하는 방법도 다르기 때문에 악성코드를 종류별로 분류하는 연구도 중요하다. 기존에는 악성코드 시각화 과정을 통해서 생성된 악성코드의 글로벌 이미지를 사용해 악성코드를 각 종류별로 분류한다. 글로벌 이미지를 악성코드로부터 추출한 바이너리 정보를 사용해서 생성한다. 하지만, 글로벌 이미지만을 사용해서 악성코드를 각 종류별로 분류하는 경우 악성코드의 종류별로 중요한 특징을 고려하기 않기 때문에 분류 정확도가 떨어진다. 본 논문에서는 악성코드의 글로벌 이미지에 악성코드의 종류별 특징을 나타내기 위한 로컬 특징 기반 글로벌 이미지를 사용한 악성코드 분류 방법을 제안한다. 첫 번째, 악성 코드로부터 바이너리를 추출하고 추출된 바이너리를 사용해서 글로벌 이미지를 생성한다. 두 번째, 악성 코드로부터 로컬 특징을 추출하고 악성코드의 종류별 핵심 로컬 특징을 단어-역문서 빈도(Term Frequency Inverse Document Frequency, TFIDF) 알고리즘을 사용해 선택한다. 세 번째, 생성된 글로벌 이미지에 악성코드의 패밀리별 핵심 특징을 픽셀화해서 적용한다. 네 번째, 생성된 로컬 특징 기반 글로벌 이미지를 사용해서 컨볼루션 모델을 학습하고, 학습된 컨볼루션 모델을 사용해서 악성코드를 각 종류별로 분류한다.