• Title/Summary/Keyword: TF-IDF Keyword Extraction

Search Result 15, Processing Time 0.022 seconds

A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables (단어 임베딩(Word Embedding) 기법을 적용한 키워드 중심의 사회적 이슈 도출 연구: 장애인 관련 뉴스 기사를 중심으로)

  • Choi, Garam;Choi, Sung-Pil
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.1
    • /
    • pp.231-250
    • /
    • 2018
  • In this paper, we propose a new methodology for extracting and formalizing subjective topics at a specific time using a set of keywords extracted automatically from online news articles. To do this, we first extracted a set of keywords by applying TF-IDF methods selected by a series of comparative experiments on various statistical weighting schemes that can measure the importance of individual words in a large set of texts. In order to effectively calculate the semantic relation between extracted keywords, a set of word embedding vectors was constructed by using about 1,000,000 news articles collected separately. Individual keywords extracted were quantified in the form of numerical vectors and clustered by K-means algorithm. As a result of qualitative in-depth analysis of each keyword cluster finally obtained, we witnessed that most of the clusters were evaluated as appropriate topics with sufficient semantic concentration for us to easily assign labels to them.

Analysis of Major COVID-19 Issues Using Unstructured Big Data (비정형 빅데이터를 이용한 COVID-19 주요 이슈 분석)

  • Kim, Jinsol;Shin, Donghoon;Kim, Heewoong
    • Knowledge Management Research
    • /
    • v.22 no.2
    • /
    • pp.145-165
    • /
    • 2021
  • As of late December 2019, the spread of COVID-19 pandemic began which put the entire world in panic. In order to overcome the crisis and minimize any subsequent damage, the government as well as its affiliated institutions must maximize effects of pre-existing policy support and introduce a holistic response plan that can reflect this changing situation- which is why it is crucial to analyze social topics and people's interests. This study investigates people's major thoughts, attitudes and topics surrounding COVID-19 pandemic through the use of social media and big data. In order to collect public opinion, this study segmented time period according to government countermeasures. All data were collected through NAVER blog from 31 December 2019 to 12 December 2020. This research applied TF-IDF keyword extraction and LDA topic modeling as text-mining techniques. As a result, eight major issues related to COVID-19 have been derived, and based on these keywords, this research presented policy strategies. The significance of this study is that it provides a baseline data for Korean government authorities in providing appropriate countermeasures that can satisfy needs of people in the midst of COVID-19 pandemic.

Performance Comparison of Keyword Extraction Methods for Web Document Cluster using Suffix Tree Clustering (Suffix Tree를 이용한 웹 문서 클러스터의 제목 생성 방법 성능 비교)

  • 염기종;권영식
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2002.11a
    • /
    • pp.328-335
    • /
    • 2002
  • 최근 들어 인터넷 기술의 발달로 웹 상에 많은 자료들이 산재해 있습니다. 사용자가 원하는 정보를 검색하기 위해서 키워드 검색을 이용하고 있는데 이러한 키워드 검색은 사용자들이 입력한 단편적인 정보에 바탕하여 검색하고 검색된 결과들을 자체적인 기준으로 순위를 매겨 나열식으로 제시하고 있다. 이러한 경우 사용자들의 생각과는 다르게 결과가 제시될 수 있다. 따라서 사용자들의 검색 시간을 줄이고 편리하게 검색하기 위한 환경의 필요성이 높아지고 있다. 본 논문에서는 Suffix Tree 알고리즘을 사용하여 관련있는 문서들을 분류하고 각각의 분류된 클러스터에 제목을 생성하기 위하여 문서 빈도수, 단어 빈도수와 역문서 빈도수, 카이 검정, 공통 정보, 엔트로피 방법을 비교 평가하여 제목을 생성하는데 어떠한 방법이 가장 효과적인지 알아보기 위해 비교 평가해본 결과 문서빈도수가 TF-IDF보다 약 10%정도 성능이 좋은 결과를 보여주었다.

  • PDF

Music Recommendation based on Blog Keyword Extraction (블로그 키워드 추출을 통한 음악 추천 기법)

  • Choi, Hong-gu;Jun, Sanghoon;Hwang, Eenjun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.701-704
    • /
    • 2010
  • 본 논문에서는 블로그의 포스트로부터 주요 키워드를 추출하여 노래 가사 데이터와 유사도를 분석, 해당 블로그 포스트에 적합한 음악을 추천하는 기법을 제안한다. 또한, 블로거가 포스트마다 제시한 태그들도 주요한 키워드로서 활용한다. 이를 위해서, 첫째로 TF-IDF 기법을 사용하여 텍스트로 구성된 포스트의 중요 키워드를 추출한다. 둘째로 포스트의 태그와 추출된 키워드를 기반으로 유사한 노래 가사를 LSA 기법으로 검색하여 가장 높은 유사도를 갖는 음악을 선택, 적합한 음악으로써 추천한다. 사용자 만족도 평가 실험을 통해서 제안하는 기법이 실제 추천에 적합한지 검증한다.

Export Control System based on Case Based Reasoning: Design and Evaluation (사례 기반 지능형 수출통제 시스템 : 설계와 평가)

  • Hong, Woneui;Kim, Uihyun;Cho, Sinhee;Kim, Sansung;Yi, Mun Yong;Shin, Donghoon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.109-131
    • /
    • 2014
  • As the demand of nuclear power plant equipment is continuously growing worldwide, the importance of handling nuclear strategic materials is also increasing. While the number of cases submitted for the exports of nuclear-power commodity and technology is dramatically increasing, preadjudication (or prescreening to be simple) of strategic materials has been done so far by experts of a long-time experience and extensive field knowledge. However, there is severe shortage of experts in this domain, not to mention that it takes a long time to develop an expert. Because human experts must manually evaluate all the documents submitted for export permission, the current practice of nuclear material export is neither time-efficient nor cost-effective. Toward alleviating the problem of relying on costly human experts only, our research proposes a new system designed to help field experts make their decisions more effectively and efficiently. The proposed system is built upon case-based reasoning, which in essence extracts key features from the existing cases, compares the features with the features of a new case, and derives a solution for the new case by referencing similar cases and their solutions. Our research proposes a framework of case-based reasoning system, designs a case-based reasoning system for the control of nuclear material exports, and evaluates the performance of alternative keyword extraction methods (full automatic, full manual, and semi-automatic). A keyword extraction method is an essential component of the case-based reasoning system as it is used to extract key features of the cases. The full automatic method was conducted using TF-IDF, which is a widely used de facto standard method for representative keyword extraction in text mining. TF (Term Frequency) is based on the frequency count of the term within a document, showing how important the term is within a document while IDF (Inverted Document Frequency) is based on the infrequency of the term within a document set, showing how uniquely the term represents the document. The results show that the semi-automatic approach, which is based on the collaboration of machine and human, is the most effective solution regardless of whether the human is a field expert or a student who majors in nuclear engineering. Moreover, we propose a new approach of computing nuclear document similarity along with a new framework of document analysis. The proposed algorithm of nuclear document similarity considers both document-to-document similarity (${\alpha}$) and document-to-nuclear system similarity (${\beta}$), in order to derive the final score (${\gamma}$) for the decision of whether the presented case is of strategic material or not. The final score (${\gamma}$) represents a document similarity between the past cases and the new case. The score is induced by not only exploiting conventional TF-IDF, but utilizing a nuclear system similarity score, which takes the context of nuclear system domain into account. Finally, the system retrieves top-3 documents stored in the case base that are considered as the most similar cases with regard to the new case, and provides them with the degree of credibility. With this final score and the credibility score, it becomes easier for a user to see which documents in the case base are more worthy of looking up so that the user can make a proper decision with relatively lower cost. The evaluation of the system has been conducted by developing a prototype and testing with field data. The system workflows and outcomes have been verified by the field experts. This research is expected to contribute the growth of knowledge service industry by proposing a new system that can effectively reduce the burden of relying on costly human experts for the export control of nuclear materials and that can be considered as a meaningful example of knowledge service application.