• Title/Summary/Keyword: TF-IDF 키워드 추출

Search Result 41, Processing Time 0.024 seconds

Design and Implementation of Paper Classification Systems based on Keyword Extraction and Clustering (키워드 추출과 군집화 기반의 논문 분류 시스템의 설계 및 구현)

  • Lee, Yun-Soo;Pheaktra, They;Lee, Jong-Hyuk;Gil, Joon-Min
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2018.05a
    • /
    • pp.48-51
    • /
    • 2018
  • 컴퓨터 및 기술의 발전으로 힘입어 수많은 논문이 오프라인뿐 아니라 온라인으로 발행되고 있고, 새로운 분야들도 계속 생기면서 사용자들은 방대한 논문들 중 자신이 필요로 하는 논문을 검색하거나 분류하기에 많은 어려움을 겪고 있다. 이러한 한계를 극복하기 위해 본 논문에서는 유사 내용의 논문을 분류하고 이를 군집화하는 방법을 제안한다. 제안하는 방법은 TF-IDF를 이용하여 각 논문의 초록으로 부터 대표 주제어를 추출하고, K-means 클러스터링 알고리즘을 이용하여 추출한 TF-IDF 값을 근거로 논문들을 유사 내용의 논문으로 군집화한다.

Music Recommendation based on Blog Keyword Extraction (블로그 키워드 추출을 통한 음악 추천 기법)

  • Choi, Hong-gu;Jun, Sanghoon;Hwang, Eenjun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.701-704
    • /
    • 2010
  • 본 논문에서는 블로그의 포스트로부터 주요 키워드를 추출하여 노래 가사 데이터와 유사도를 분석, 해당 블로그 포스트에 적합한 음악을 추천하는 기법을 제안한다. 또한, 블로거가 포스트마다 제시한 태그들도 주요한 키워드로서 활용한다. 이를 위해서, 첫째로 TF-IDF 기법을 사용하여 텍스트로 구성된 포스트의 중요 키워드를 추출한다. 둘째로 포스트의 태그와 추출된 키워드를 기반으로 유사한 노래 가사를 LSA 기법으로 검색하여 가장 높은 유사도를 갖는 음악을 선택, 적합한 음악으로써 추천한다. 사용자 만족도 평가 실험을 통해서 제안하는 기법이 실제 추천에 적합한지 검증한다.

Query Expansion based on Word Sense Community (유사 단어 커뮤니티 기반의 질의 확장)

  • Kwak, Chang-Uk;Yoon, Hee-Geun;Park, Seong-Bae
    • Journal of KIISE
    • /
    • v.41 no.12
    • /
    • pp.1058-1065
    • /
    • 2014
  • In order to assist user's who are in the process of executing a search, a query expansion method suggests keywords that are related to an input query. Recently, several studies have suggested keywords that are identified by finding domains using a clustering method over the documents that are retrieved. However, the clustering method is not relevant when presenting various domains because the number of clusters should be fixed. This paper proposes a method that suggests keywords by finding various domains related to the input queries by using a community detection algorithm. The proposed method extracts words from the top-30 documents of those that are retrieved and builds communities according to the word graph. Then, keywords representing each community are derived, and the represented keywords are used for the query expansion method. In order to evaluate the proposed method, we compared our results to those of two baseline searches performed by the Google search engine and keyword recommendation using TF-IDF in the search results. The results of the evaluation indicate that the proposed method outperforms the baseline with respect to diversity.

Korea National College of Agriculture and Fisheries in Naver News by Web Crolling : Based on Keyword Analysis and Semantic Network Analysis (웹 크롤링에 의한 네이버 뉴스에서의 한국농수산대학 - 키워드 분석과 의미연결망분석 -)

  • Joo, J.S.;Lee, S.Y.;Kim, S.H.;Park, N.B.
    • Journal of Practical Agriculture & Fisheries Research
    • /
    • v.23 no.2
    • /
    • pp.71-86
    • /
    • 2021
  • This study was conducted to find information on the university's image from words related to 'Korea National College of Agriculture and Fisheries (KNCAF)' in Naver News. For this purpose, word frequency analysis, TF-IDF evaluation and semantic network analysis were performed using web crawling technology. In word frequency analysis, 'agriculture', 'education', 'support', 'farmer', 'youth', 'university', 'business', 'rural', 'CEO' were important words. In the TF-IDF evaluation, the key words were 'farmer', 'dron', 'agricultural and livestock food department', 'Jeonbuk', 'young farmer', 'agriculture', 'Chonju', 'university', 'device', 'spreading'. In the semantic network analysis, the Bigrams showed high correlations in the order of 'youth' - 'farmer', 'digital' - 'agriculture', 'farming' - 'settlement', 'agriculture' - 'rural', 'digital' - 'turnover'. As a result of evaluating the importance of keywords as five central index, 'agriculture' ranked first. And the keywords in the second place of the centrality index were 'farmers' (Cc, Cb), 'education' (Cd, Cp) and 'future' (Ce). The sperman's rank correlation coefficient by centrality index showed the most similar rank between Degree centrality and Pagerank centrality. The KNCAF articles of Naver News were used as important words such as 'agriculture', 'education', 'support', 'farmer', 'youth' in terms of word frequency. However, in the evaluation including document frequency, the words such as 'farmer', 'dron', 'Ministry of Agriculture, Food and Rural Affairs', 'Jeonbuk', and 'young farmers' were found to be key words. The centrality analysis considering the network connectivity between words was suitable for evaluation by Cd and Cp. And the words with strong centrality were 'agriculture', 'education', 'future', 'farmer', 'digital', 'support', 'utilization'.

A Term Weight Mensuration based on Popularity for Search Query Expansion (검색 질의 확장을 위한 인기도 기반 단어 가중치 측정)

  • Lee, Jung-Hun;Cheon, Suh-Hyun
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.8
    • /
    • pp.620-628
    • /
    • 2010
  • With the use of the Internet pervasive in everyday life, people are now able to retrieve a lot of information through the web. However, exponential growth in the quantity of information on the web has brought limits to online search engines in their search performance by showing piles and piles of unwanted information. With so much unwanted information, web users nowadays need more time and efforts than in the past to search for needed information. This paper suggests a method of using query expansion in order to quickly bring wanted information to web users. Popularity based Term Weight Mensuration better performance than the TF-IDF and Simple Popularity Term Weight Mensuration to experiments without changes of search subject. When a subject changed during search, Popularity based Term Weight Mensuration's performance change is smaller than others.

Analysis of Major COVID-19 Issues Using Unstructured Big Data (비정형 빅데이터를 이용한 COVID-19 주요 이슈 분석)

  • Kim, Jinsol;Shin, Donghoon;Kim, Heewoong
    • Knowledge Management Research
    • /
    • v.22 no.2
    • /
    • pp.145-165
    • /
    • 2021
  • As of late December 2019, the spread of COVID-19 pandemic began which put the entire world in panic. In order to overcome the crisis and minimize any subsequent damage, the government as well as its affiliated institutions must maximize effects of pre-existing policy support and introduce a holistic response plan that can reflect this changing situation- which is why it is crucial to analyze social topics and people's interests. This study investigates people's major thoughts, attitudes and topics surrounding COVID-19 pandemic through the use of social media and big data. In order to collect public opinion, this study segmented time period according to government countermeasures. All data were collected through NAVER blog from 31 December 2019 to 12 December 2020. This research applied TF-IDF keyword extraction and LDA topic modeling as text-mining techniques. As a result, eight major issues related to COVID-19 have been derived, and based on these keywords, this research presented policy strategies. The significance of this study is that it provides a baseline data for Korean government authorities in providing appropriate countermeasures that can satisfy needs of people in the midst of COVID-19 pandemic.

An Exploratory Study of Happiness and Unhappiness Among Koreans based on Text Mining Techniques (텍스트마이닝 기법을 활용한 한국인의 행복과 불행 탐색연구)

  • Park, Sanghyeon;Do, Kanghyuk;Kim, Hakyeong;Park, Gaeun;Yun, Jinhyeok;Kim, Kyungil
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.7
    • /
    • pp.10-27
    • /
    • 2018
  • The purpose of this study is to explore the meaning of happiness and unhappiness in Korean society through text mining analysis. Similar words with keywords(happiness/unhappiness) from online news portal are extracted using Word2Vec and TF-IDF method. We also use the K-LIWC dictionary to perform the sentiment analysis of words associated with happiness and unhappiness. In TF-IDF analysis, happiness and unhappiness are highly related to social factors and social issues of the year. In Word2Vec analysis, 'Hope' has been similar with happiness for six years. In K-LIWC analysis, 'money/financial issues', 'school', 'communication' is highly related with happiness and unhappiness. In addition, 'physical condition and symptom' is highly related to unhappiness. Implications, limitations, and suggestions for future research are also discussed.

Metadata extraction using AI and advanced metadata research for web services (AI를 활용한 메타데이터 추출 및 웹서비스용 메타데이터 고도화 연구)

  • Sung Hwan Park
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.2
    • /
    • pp.499-503
    • /
    • 2024
  • Broadcasting programs are provided to various media such as Internet replay, OTT, and IPTV services as well as self-broadcasting. In this case, it is very important to provide keywords for search that represent the characteristics of the content well. Broadcasters mainly use the method of manually entering key keywords in the production process and the archive process. This method is insufficient in terms of quantity to secure core metadata, and also reveals limitations in recommending and using content in other media services. This study supports securing a large number of metadata by utilizing closed caption data pre-archived through the DTV closed captioning server developed in EBS. First, core metadata was automatically extracted by applying Google's natural language AI technology. The next step is to propose a method of finding core metadata by reflecting priorities and content characteristics as core research contents. As a technology to obtain differentiated metadata weights, the importance was classified by applying the TF-IDF calculation method. Successful weight data were obtained as a result of the experiment. The string metadata obtained by this study, when combined with future string similarity measurement studies, becomes the basis for securing sophisticated content recommendation metadata from content services provided to other media.

Multi-Modal Scheme for Music Mood Classification (멀티 모달 음악 무드 분류 기법)

  • Choi, Hong-Gu;Jun, Sang-Hoon;Hwang, Een-Jun
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2011.06a
    • /
    • pp.259-262
    • /
    • 2011
  • 최근 들어 소리의 세기나 하모니, 템포, 리듬 등의 다양한 음악 신호 특성을 기반으로 한 음악 무드 분류에 대한 연구가 활발하게 진행되고 있다. 본 논문에서는 음악 무드 분류의 정확도를 높이기 위하여 음악 신호 특성과 더불어 노래 가사와 소셜 네트워크 상에서의 사용자 평가 등을 함께 고려하는 멀티 모달 음악 무드 분류 기법을 제안한다. 이를 위해, 우선 음악 신호 특성에 대해 퍼지 추론 기반의 음악 무드 추출 기법을 적용하여 다수의 가능한 음악 무드를 추출한다. 다음으로 음악 가사에 대해 TF-IDF 기법을 적용하여 대표 감정 키워드를 추출하고 학습시킨 가사 무드 분류기를 사용하여 가사 음악 무드를 추출한다. 마지막으로 소셜 네트워크 상에서의 사용자 태그 등 사용자 피드백을 통한 음악 무드를 추출한다. 특정 음악에 대해 이러한 다양한 경로를 통한 음악 무드를 교차 분석하여 최종적으로 음악 무드를 결정한다. 음악 분류를 기반한 자동 음악 추천을 수행하는 사용자 만족도 평가 실험을 통해서 제안하는 기법의 효율성을 검증한다.

Design and Implementation of PMSL for Information Retrieval (의미있는 정보 검색을 위한 개인화된 다중 전략 학습 모듈의 설계 및 구현)

  • 유수경;김교정
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04b
    • /
    • pp.208-210
    • /
    • 2004
  • 오늘날 인터넷상에서 존재하는 않은 정보들은 다양한 사용자의 개인 특성에 안게 새로운 정보의 지식으로 제공되어지기를 원한다. 기존의 연구는 단일 학술 기법을 통해 정보를 추출했으나 사용자에게 보다 의미 있는 정보를 제공하기 위해 다중 전략 학습 기법인 PMSL(Personalized Multi-Strategy Learning) 모듈 시스템을 제안하고자 한다. PMSL 모듈은 인터넷의 정보를 여과하여 필터링하고, 사용자 개인화의 키워드를 중심으로 연관된 객체를 추출한다. 이때 연관된 객체 추출시 대용량 데이터에서 시간적, 공간적면에서 효율적인 연관 탐색 기법인 Fp-Tree와 Fp-Growth 알고리즘을 적용시킴으로 결과의 효율성을 높이고자 하였으며, 연관규칙의 문제점을 보완하기 위해 가중치 기법인 TF*IDF 학습 기법을 적용시켰다. PMSL 모듈을 실행한 결과 기존 학습 기법에 비해 보다 더 의미 있는 연관 지식을 추출하게 되었다.

  • PDF