• Title/Summary/Keyword: keyword extraction

Search Result 189, Processing Time 0.025 seconds

An Extraction Method of Bibliographic Information from the US Patents: Using an HTML Parsing Technique (미국 특허 서지정보 추출 방법에 대한 연구: HTML 파싱 기법의 활용을 중심으로)

  • Han, Yoo-Jin;Oh, Seung-Woo
    • Journal of the Korean Society for information Management
    • /
    • v.27 no.2
    • /
    • pp.7-20
    • /
    • 2010
  • This study aims to provide a method of extracting the most recent information on US patent documents. An HTML paring technique that can directly connect to the US Patent and Trademark Office (USPTO) Web page is adopted. After obtaining a list of 50 documents through a keyword searching method, this study suggested an algorithm, using HTML parsing techniques, which can extract a patent number, an applicant, and the US patent class information. The study also revealed an algorithm by which we can extract both patents and subsequent patents using their closely connected relationship, that is a very distinctive characteristic of US patent documents. Although the proposed method has several limitations, it can supplement existing databases effectively in terms of timeliness and comprehensiveness.

Document Analysis based Main Requisite Extraction System (문서 분석 기반 주요 요소 추출 시스템)

  • Lee, Jongwon;Yeo, Ilyeon;Jung, Hoekyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.4
    • /
    • pp.401-406
    • /
    • 2019
  • In this paper, we propose a system for analyzing documents in XML format and in reports. The system extracts the paper or reports of keywords, shows them to the user, and then extracts the paragraphs containing the keywords by inputting the keywords that the user wants to search within the document. The system checks the frequency of keywords entered by the user, calculates weights, and removes paragraphs containing only keywords with the lowest weight. Also, we divide the refined paragraphs into 10 regions, calculate the importance of the paragraphs per region, compare the importance of each region, and inform the user of the main region having the highest importance. With these features, the proposed system can provide the main paragraphs with higher compression ratio than analyzing the papers or reports using the existing document analysis system. This will reduce the time required to understand the document.

An Automatically Extracting Formal Information from Unstructured Security Intelligence Report (비정형 Security Intelligence Report의 정형 정보 자동 추출)

  • Hur, Yuna;Lee, Chanhee;Kim, Gyeongmin;Jo, Jaechoon;Lim, Heuiseok
    • Journal of Digital Convergence
    • /
    • v.17 no.11
    • /
    • pp.233-240
    • /
    • 2019
  • In order to predict and respond to cyber attacks, a number of security companies quickly identify the methods, types and characteristics of attack techniques and are publishing Security Intelligence Reports(SIRs) on them. However, the SIRs distributed by each company are huge and unstructured. In this paper, we propose a framework that uses five analytic techniques to formulate a report and extract key information in order to reduce the time required to extract information on large unstructured SIRs efficiently. Since the SIRs data do not have the correct answer label, we propose four analysis techniques, Keyword Extraction, Topic Modeling, Summarization, and Document Similarity, through Unsupervised Learning. Finally, has built the data to extract threat information from SIRs, analysis applies to the Named Entity Recognition (NER) technology to recognize the words belonging to the IP, Domain/URL, Hash, Malware and determine if the word belongs to which type We propose a framework that applies a total of five analysis techniques, including technology.

A Study on Keywords Extraction from Entertainment News using Bigdata Processing (빅데이터 처리를 통한 연예 뉴스에서의 키워드 추출에 관한 연구)

  • Yoo, Sang-Hyun;Lee, Sang-Jun
    • Jounal of The Korea Society of Information Technology Policy & Management
    • /
    • v.11 no.6
    • /
    • pp.1503-1507
    • /
    • 2019
  • With the softness of online entertainment news articles and the increasing number of quick-reporting articles in the entertainment sector, many people have access to entertainment front-page articles and are now able to make reviews of celebrities. It is not easy to systematically analyze which news articles are about which celebrities in a real-time environment, although their reputation is a key factor in the entertainment agency's business strategy, which should make the most of its affiliated celebrity resources. Based on the amount of celebrity references mentioned in entertainment news data, this paper proposes an entertainment news keyword analysis system, which extracts celebrities that are the subject of the article and associates them with the celebrity entertainment agency in question. Through the system proposed in this paper, advertisers or entertainment agencies can judge the value of the celebrity as reference material for the business. In addition, it can lay the groundwork for an investment strategy by predicting the outlook for the entertainment company for brokerages and investors.

Implementation of Web Based Video Learning Evaluation System Using User Profiles (사용자 프로파일을 이용한 웹 기반 비디오 학습 평가 시스템의 구현)

  • Shin Seong-Yoon;Kang Il-Ko;Lee Yang-Won
    • Journal of the Korea Society of Computer and Information
    • /
    • v.10 no.6 s.38
    • /
    • pp.137-152
    • /
    • 2005
  • In this Paper, we Propose an efficient web-based video learning evaluation system that is tailored to individual student's characteristics through the use of user profile-based information filtering. As a means of giving video-based questions, keyframes are extracted based on the location, size, and color information, and question-making intervals are extracted by means of differences in gray-level histograms as well as time windows. In addition, through a combination of the category-based system and the keyword-based system, questions for examination are given in order to ensure efficient evaluation. Therefore, students can enhance school achievement by making up for weak areas while continuing to identify their areas of interest.

  • PDF

Design and Implementation of the Extraction Mashup for Reported Disaster Information on SNSs (SNS에 제보되는 재해정보 추출 매시업 설계 및 구현)

  • Seo, Tae-Woong;Park, Man-Gon;Kim, Chang-Soo
    • Journal of Korea Multimedia Society
    • /
    • v.16 no.11
    • /
    • pp.1297-1304
    • /
    • 2013
  • The quick report and propagate information are increasingly important because nowadays it is hard to predict the damages of flooding by unexpected heavy rain. In addition, there are not many ways to receive disaster information in real time. Accordingly, we designed the system which can earn information from a lot of messages on twitter. Above all, our system can extract and deploy disaster information by comparison with erstwhile social network service mash-up system as only broadcast media. Significant objective of this paper is to design the fastest extract disaster information system of mass media.

A Method of Predicting Service Time Based on Voice of Customer Data (고객의 소리(VOC) 데이터를 활용한 서비스 처리 시간 예측방법)

  • Kim, Jeonghun;Kwon, Ohbyung
    • Journal of Information Technology Services
    • /
    • v.15 no.1
    • /
    • pp.197-210
    • /
    • 2016
  • With the advent of text analytics, VOC (Voice of Customer) data become an important resource which provides the managers and marketing practitioners with consumer's veiled opinion and requirements. In other words, making relevant use of VOC data potentially improves the customer responsiveness and satisfaction, each of which eventually improves business performance. However, unstructured data set such as customers' complaints in VOC data have seldom used in marketing practices such as predicting service time as an index of service quality. Because the VOC data which contains unstructured data is too complicated form. Also that needs convert unstructured data from structure data which difficult process. Hence, this study aims to propose a prediction model to improve the estimation accuracy of the level of customer satisfaction by combining unstructured from textmining with structured data features in VOC. Also the relationship between the unstructured, structured data and service processing time through the regression analysis. Text mining techniques, sentiment analysis, keyword extraction, classification algorithms, decision tree and multiple regression are considered and compared. For the experiment, we used actual VOC data in a company.

A Normalization Method of Distorted Korean SMS Sentences for Spam Message Filtering (스팸 문자 필터링을 위한 변형된 한글 SMS 문장의 정규화 기법)

  • Kang, Seung-Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.7
    • /
    • pp.271-276
    • /
    • 2014
  • Short message service(SMS) in a mobile communication environment is a very convenient method. However, it caused a serious side effect of generating spam messages for advertisement. Those who send spam messages distort or deform SMS sentences to avoid the messages being filtered by automatic filtering system. In order to increase the performance of spam filtering system, we need to recover the distorted sentences into normal sentences. This paper proposes a method of normalizing the various types of distorted sentence and extracting keywords through automatic word spacing and compound noun decomposition.

Performance Evaluations of Text Ranking Algorithms

  • Kim, Myung-Hwi;Jang, Beakcheol
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.2
    • /
    • pp.123-131
    • /
    • 2020
  • The text ranking algorithm is a representative method for keyword extraction, and its importance is emphasized highly. In this paper, we compare the performance of recent research and experiments with TF-IDF, SMART, INQUERY and CCA algorithms, which are used in text ranking algorithm.. After explaining each algorithm, we compare the performance of each algorithm based on the data collected from news and Twitter. Experimental results show that all of four algorithms can extract specific words from news data equally. However, in the case of Twitter, CCA has the best performance to extract specific words, and INQUERY shows the worst performance. We also analyze the accuracy of the algorithm through six comparison metrics. The experimental results present that CCA shows the best accuracy in the news data. In case of Twitter, TF-IDF and CCA show similar performance and demonstrate good performance.

Multi-Topic Meeting Summarization using Lexical Co-occurrence Frequency and Distribution (어휘의 동시 발생 빈도와 분포를 이용한 다중 주제 회의록 요약)

  • Lee, Byung-Soo;Lee, Jee-Hyong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2015.07a
    • /
    • pp.13-16
    • /
    • 2015
  • 본 논문에서는 어휘의 동시 발생 (co-occurrence) 빈도와 분포를 이용한 회의록 요약방법을 제안한다. 회의록은 일반 문서와 달리 문서에 여러 세부적인 주제들이 나타나며, 잘못된 형식의 문장, 불필요한 잡담들을 포함하고 있기 때문에 이러한 특징들이 문서요약 과정에서 고려되어야 한다. 기존의 일반적인 문서요약 방법은 하나의 주제를 기반으로 문서 전체에서 가장 중요한 문장으로 요약하기 때문에 다중 주제 회의록 요약에는 적합하지 않다. 제안한 방법은 먼저 어휘의 동시 발생 (co-occurrence) 빈도를 이용하여 회의록 분할 (segmentation) 과정을 수행한다. 다음으로 주제의 구분에 따라 분할된 각 영역 (block)의 중요 단어 집합 생성, 중요 문장 추출 과정을 통해 회의록의 중요 문장들을 선별한다. 마지막으로 추출된 중요 문장들의 위치, 종속 관계를 고려하여 최종적으로 회의록을 요약한다. AMI meeting corpus를 대상으로 실험한 결과, 제안한 방법이 baseline 요약 방법들보다 요약 비율에 따른 평가 및 요약문의 세부 주제별 평가에서 우수한 요약 성능을 보임을 확인하였다.

  • PDF