• Title/Summary/Keyword: Keyword extraction

Search Result 190, Processing Time 0.025 seconds

Web Site Keyword Selection Method by Considering Semantic Similarity Based on Word2Vec (Word2Vec 기반의 의미적 유사도를 고려한 웹사이트 키워드 선택 기법)

  • Lee, Donghun;Kim, Kwanho
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.83-96
    • /
    • 2018
  • Extracting keywords representing documents is very important because it can be used for automated services such as document search, classification, recommendation system as well as quickly transmitting document information. However, when extracting keywords based on the frequency of words appearing in a web site documents and graph algorithms based on the co-occurrence of words, the problem of containing various words that are not related to the topic potentially in the web page structure, There is a difficulty in extracting the semantic keyword due to the limit of the performance of the Korean tokenizer. In this paper, we propose a method to select candidate keywords based on semantic similarity, and solve the problem that semantic keyword can not be extracted and the accuracy of Korean tokenizer analysis is poor. Finally, we use the technique of extracting final semantic keywords through filtering process to remove inconsistent keywords. Experimental results through real web pages of small business show that the performance of the proposed method is improved by 34.52% over the statistical similarity based keyword selection technique. Therefore, it is confirmed that the performance of extracting keywords from documents is improved by considering semantic similarity between words and removing inconsistent keywords.

Rotation and Translation Invariant Feature Extraction Using Angular Projection in Frequency Domain (주파수 영역에서 각도 투영법을 이용한 회전 및 천이 불변 특징추출)

  • Lee, Bum-Shik;Kim, Mun-Churl
    • 한국HCI학회:학술대회논문집
    • /
    • 2006.02a
    • /
    • pp.699-704
    • /
    • 2006
  • 본 논문은 회전 및 천이불변 이미지 텍스처 검색의 새로운 방식을 소개한다. 주파수 영역의 극좌표계에서 동일한 공간 주파수에서 각도 방향으로 투영을 하는 각도 투영법을 제안하며, 제안된 각도 투영법을 이용하여 주파수 영역에서 푸리에 계수이 합과 표준편차를 특징벡터로 이용한다. 각도 투영법을 쉽게 구현하기 위해 극좌표계에서 라돈변환이 수행된다. 실험 시 MPEG-7 데이터를 이용하였으며 그 결과는 여러 텍스처 이미지를 검색하는데 있어서 특징을 잘 구별해 내는 결과를 보여준다. 또한 제안된 회전 및 천이불변 특징 추출 알고리듬은 등방성 텍스처나 국부적인 방향성을 보이는 텍스처 영상 검색에도 효율적인 검색률을 보인다.

  • PDF

Thematic Word Extraction from Book Based on Keyword Weighting Method (키워드 가중치 방식에 근거한 도서 본문 주제어 추출)

  • Ahn, Hee-Jeong;Choi, Gun-Hee;Kim, Seung-Hoon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2015.01a
    • /
    • pp.19-22
    • /
    • 2015
  • 본 논문에서는 문장 및 문단에서 키워드의 역할에 따른 가중치에 근거하여 도서 본문에서 주제어를 추출하는 방법을 제안한다. 기존의 주제어 추출 방식은 도서 본문이 아닌 신문이나 논문에 대한 방식이므로 도서 본문에서의 주제어 추출에 그대로 적용하기에는 어려움이 있다. 따라서 본 논문에서는 빈도수뿐만 아니라 문장 내 중요 요소에 대한 가중치와 중요 문장에 대한 가중치를 후보 키워드에 부여하는 방식을 제안하였다. 제안한 계산 방식을 비문학 도서에 대하여 실험한 결과, 빈도수만으로 주제어를 추출한 기존 방식보다 본 논문에서 제안한 방식의 주제어 추출 결과의 정확도가 향상되는 것을 확인하였다.

  • PDF

A Design of KP AGENT for Intelligent Information Retrieval (지능형 정보검색을 위한 KP AGENT의 설계)

  • 박경우;배상현
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.4 no.2
    • /
    • pp.443-451
    • /
    • 2000
  • Until now, there have been various kinds of science information databsae which databased the science technology information, but they do not satisfy the aspiration of the users. Therefore, in the position of the users, it suggests the technology information space as a now paradigm, which supplement the function of science information DB. ICPIS which inputs described papers with keywords, offers the itemized summary of these contents, the visual indication and comparison of similar thesis, and it also supplises the abundant summary information, survey information, more than ten volumes of info communication thesis with starting the casual relation extraction for the users, playing a significant role in ICPIS is called KP, and it is package of domain knowledge that unifies the extraction and structure narration of the technology information. ICPIS extracts the technology information among the thesis that are deserved by the natural language treatment in the itemized KP keywords described, and form the prescribed summary structure in KP.

  • PDF

Representative Keyword Extraction from Few Documents through Fuzzy Inference (퍼지추론을 이용한 소수 문서의 대표 키워드 추출)

  • 노순억;김병만;허남철
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.11 no.9
    • /
    • pp.837-843
    • /
    • 2001
  • In this work, we propose a new method of extracting and weighting representative keywords(RKs) from a few documents that might interest a user. In order to extract RKs, we first extract candidate terms and them choose a number of terms called initial representative keywords (IRKs) from them through fuzzy inference. Then, by expanding and reweighting IRKs using term co-occurrence similarity, the final RKs are obtained. Performance of our approach is heavily influenced by effectiveness of selection method of IRKs so that we choose fuzzy inference because it is more effective in handling the uncertainty inherent in selecting representative keywords of documents. The problem addressed in this paper can be viewed as the one of calculating center of document vectors. So, to show the usefulness of our approach, we compare with two famous methods - Rocchio and Widrow-Hoff - on a number of documents collections. The result show that our approach outperforms the other approaches.

  • PDF

Comparative Analysis of Work-Life Balance Issues between Korea and the United States (워라밸 이슈 비교 분석: 한국과 미국)

  • Lee, So-Hyun;Kim, Minsu;Kim, Hee-Woong
    • The Journal of Information Systems
    • /
    • v.28 no.2
    • /
    • pp.153-179
    • /
    • 2019
  • Purpose This study collects the issues about work-life balance in Korea and United States and suggests the specific plans for work-life balance by the comparison and analysis. The objective of this study is to contribute to the improvement of people's life quality by understanding the concept of work-life balance that has become the issue recently and offering the detailed plans to be considered in respect of individual, corporate and governmental level for society of work-life balance. Design/methodology/approach This study collects work-life balance related issues through recruit sites in Korea and United States, compares and analyzes the collected data from the results of three text mining techniques such as LDA topic modeling, term frequency analysis and keyword extraction analysis. Findings According to the text mining results, this study shows that it is important to build corporate culture that support work-life balance in free organizational atmosphere especially in Korea. It also appears that there are the differences against whether work-life balance can be achieved and recognition and satisfaction about work-life balance along type of company or sort of working. In case of United States, it shows that it is important for them to work more efficiently by raising teamwork level among team members who work together as well as the role of the leaders who lead the teams in the organization. It is also significant for the company to provide their employees with the opportunity of education and training that enables them to improve their individual capability or skill. Furthermore, it suggests the roles of individuals, company and government and specific plans based on the analysis of text mining results in both countries.

KR-WordRank : An Unsupervised Korean Word Extraction Method Based on WordRank (KR-WordRank : WordRank를 개선한 비지도학습 기반 한국어 단어 추출 방법)

  • Kim, Hyun-Joong;Cho, Sungzoon;Kang, Pilsung
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.40 no.1
    • /
    • pp.18-33
    • /
    • 2014
  • A Word is the smallest unit for text analysis, and the premise behind most text-mining algorithms is that the words in given documents can be perfectly recognized. However, the newly coined words, spelling and spacing errors, and domain adaptation problems make it difficult to recognize words correctly. To make matters worse, obtaining a sufficient amount of training data that can be used in any situation is not only unrealistic but also inefficient. Therefore, an automatical word extraction method which does not require a training process is desperately needed. WordRank, the most widely used unsupervised word extraction algorithm for Chinese and Japanese, shows a poor word extraction performance in Korean due to different language structures. In this paper, we first discuss why WordRank has a poor performance in Korean, and propose a customized WordRank algorithm for Korean, named KR-WordRank, by considering its linguistic characteristics and by improving the robustness to noise in text documents. Experiment results show that the performance of KR-WordRank is significantly better than that of the original WordRank in Korean. In addition, it is found that not only can our proposed algorithm extract proper words but also identify candidate keywords for an effective document summarization.

Keyword Weight based Paragraph Extraction Algorithm (문단 가중치 분석 기반 본문 영역 선정 알고리즘)

  • Lee, Jongwon;Yu, Seongjong;Kim, Doan;Jung, Hoekyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2018.05a
    • /
    • pp.462-463
    • /
    • 2018
  • Traditional document analysis systems used word-based analysis using a morphological analyzer or TF-IDF technique. These systems have the advantage of being able to derive key keywords by calculating the weights of the keywords. On the other hand, it is not appropriate to analyze the contents of documents due to the structural limitations. To solve this problem, the proposed algorithm calculates the weights of the documents in the document and divides the paragraphs into areas. And we calculate the importance of the divided regions and let the user know the area with the most important paragraphs in the document. So, it is expected that the user will be provided with a service suitable for analyzing documents rather than using existing document analysis systems.

  • PDF

Contents Analysis and Synthesis Scheme for Music Album Cover Art

  • Moon, Dae-Jin;Rho, Seung-Min;Hwang, Een-Jun
    • Journal of IKEEE
    • /
    • v.14 no.4
    • /
    • pp.305-311
    • /
    • 2010
  • Most recent web search engines perform effective keyword-based multimedia contents retrieval by investigating keywords associated with multimedia contents on the Web and comparing them with query keywords. On the other hand, most music and compilation albums provide professional artwork as cover art that will be displayed when the music is played. If the cover art is not available, then the music player just displays some dummy or random images, but this has been a source of dissatisfaction. In this paper, in order to automatically create cover art that is matched with music contents, we propose a music album cover art creation scheme based on music contents analysis and result synthesis. We first (i) analyze music contents and their lyrics and extract representative keywords, (ii) expand the keywords using WordNet and generate various queries, (iii) retrieve related images from the Web using those queries, and finally (iv) synthesize them according to the user preference for album cover art. To show the effectiveness of our scheme, we developed a prototype system and reported some results.

Proposal of keyword extraction method based on morphological analysis and PageRank in Tweeter (트위터에서 형태소 분석과 PageRank 기반 화제단어 추출 방법 제안)

  • Lee, Won-Hyung;Cho, Sung-Il;Kim, Dong-Hoi
    • Journal of Digital Contents Society
    • /
    • v.19 no.1
    • /
    • pp.157-163
    • /
    • 2018
  • People who use SNS publish their diverse ideas on SNS every day. The data posted on the SNS contains many people's thoughts and opinions. In particular, popular keywords served on Twitter compile the number of frequently appearing words in user posts and rank them. However, this method is sensitive to unnecessary data simply by listing duplicate words. The proposed method determines the ranking based on the topic of the word using the relationship diagram between words, so that the influence of unnecessary data is less and the main word can be stably extracted. For the performance comparison in terms of the descending keyword rank and the ratios of meaningless keywords among high rank 20 keywords, we make a comparison between the proposed scheme which is based on morphological analysis and PageRank, and the existing scheme which is based on the number of appearances. As a result, the proposed scheme and the existing scheme have included 55% and 70% of meaningless keywords among high rank 20 keywords, respectively, where the proposed scheme is improved about 15% compared with the existing scheme.