• 제목/요약/키워드: Document summarization

Search Result 114, Processing Time 0.025 seconds

Multi-document Summarization using Non-negative Matrix Factorization and NMF Clustering Method (비음수 행렬 인수분해와 NMF 군집방법을 이용한 다중문서요약)

  • Park, Sun;Lee, Ju-Hong;Kim, Chul-Won
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2008.05a
    • /
    • pp.427-430
    • /
    • 2008
  • 본 논문은 비음수 행렬 인수분해(NMF, non-negative matrix factorization)와 NMF 군집방법을 이용하여 다중문서를 요약하는 새로운 방법을 제안하였다. 본 논문에서 NMF에 의해 계산된 의미 특징(semantic feature)은 문서의 고유 구조(inherent structure)를 반영하여 문장을 추출함으로써 요약의 질을 높일 수 있고, 의미 변수(semantic variable)를 이용한 문장의 군집은 문장 간의 유사성과 다양성 고려하여서 쉽게 과잉정보를 제거하여 문장을 요약할 수 있는 장점을 갖는다.

Domain-Adaptive Pre-training for Korean Document Summarization (도메인 적응 사전 훈련 (Domain-Adaptive Pre-training, DAPT) 한국어 문서 요약)

  • Hyungkuk Jang;Hyuncheol, Jang
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.843-845
    • /
    • 2024
  • 도메인 적응 사전 훈련(Domain-Adaptive Pre-training, DAPT)을 활용한 한국어 문서 요약 연구에서는 특정 도메인의 문서에 대한 이해도와 요약 성능을 향상시키기 위해 DAPT 기법을 적용했다. 이 연구는 사전 훈련된 언어 모델이 일반적인 언어 이해 능력을 넘어 특정 도메인에 최적화된 성능을 발휘할 수 있도록 도메인 특화 데이터셋을 사용하여 추가적인 사전 훈련을 진행한다. 구체적으로, 의료, 법률, 기술 등 다양한 도메인에서 수집한 한국어 텍스트 데이터를 이용하여 모델을 미세 조정하며, 이를 통해 얻은 모델은 도메인에 특화된 용어와 문맥을 효과적으로 처리할 수 있음을 보여준다. 성능 평가에서는 기존 사전 훈련 모델과 DAPT를 적용한 모델을 비교하여 DAPT의 효과를 검증했다. 연구 결과, DAPT를 적용한 모델은 도메인 특화 문서 요약 작업에서 성능 향상을 보였으며, 이는 실제 도메인별 활용에서도 유용할 것으로 기대된다.

Study on Designing and Implementing Online Customer Analysis System based on Relational and Multi-dimensional Model (관계형 다차원모델에 기반한 온라인 고객리뷰 분석시스템의 설계 및 구현)

  • Kim, Keun-Hyung;Song, Wang-Chul
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.4
    • /
    • pp.76-85
    • /
    • 2012
  • Through opinion mining, we can analyze the degree of positive or negative sentiments that customers feel about important entities or attributes in online customer reviews. But, the limit of the opinion mining techniques is to provide only simple functions in analyzing the reviews. In this paper, we proposed novel techniques that can analyze the online customer reviews multi-dimensionally. The novel technique is to modify the existing OLAP techniques so that they can be applied to text data. The novel technique, that is, multi-dimensional analytic model consists of noun, adjective and document axes which are converted into four relational tables in relational database. The multi-dimensional analysis model would be new framework which can converge the existing opinion mining, information summarization and clustering algorithms. In this paper, we implemented the multi-dimensional analysis model and algorithms. we recognized that the system would enable us to analyze the online customer reviews more complexly.

An Improved Automatic Text Summarization Based on Lexical Chaining Using Semantical Word Relatedness (단어 간 의미적 연관성을 고려한 어휘 체인 기반의 개선된 자동 문서요약 방법)

  • Cha, Jun Seok;Kim, Jeong In;Kim, Jung Min
    • Smart Media Journal
    • /
    • v.6 no.1
    • /
    • pp.22-29
    • /
    • 2017
  • Due to the rapid advancement and distribution of smart devices of late, document data on the Internet is on the sharp increase. The increment of information on the Web including a massive amount of documents makes it increasingly difficult for users to understand corresponding data. In order to efficiently summarize documents in the field of automated summary programs, various researches are under way. This study uses TextRank algorithm to efficiently summarize documents. TextRank algorithm expresses sentences or keywords in the form of a graph and understands the importance of sentences by using its vertices and edges to understand semantic relations between vocabulary and sentence. It extracts high-ranking keywords and based on keywords, it extracts important sentences. To extract important sentences, the algorithm first groups vocabulary. Grouping vocabulary is done using a scale of specific weight. The program sorts out sentences with higher scores on the weight scale, and based on selected sentences, it extracts important sentences to summarize the document. This study proved that this process confirmed an improved performance than summary methods shown in previous researches and that the algorithm can more efficiently summarize documents.

An Automatically Extracting Formal Information from Unstructured Security Intelligence Report (비정형 Security Intelligence Report의 정형 정보 자동 추출)

  • Hur, Yuna;Lee, Chanhee;Kim, Gyeongmin;Jo, Jaechoon;Lim, Heuiseok
    • Journal of Digital Convergence
    • /
    • v.17 no.11
    • /
    • pp.233-240
    • /
    • 2019
  • In order to predict and respond to cyber attacks, a number of security companies quickly identify the methods, types and characteristics of attack techniques and are publishing Security Intelligence Reports(SIRs) on them. However, the SIRs distributed by each company are huge and unstructured. In this paper, we propose a framework that uses five analytic techniques to formulate a report and extract key information in order to reduce the time required to extract information on large unstructured SIRs efficiently. Since the SIRs data do not have the correct answer label, we propose four analysis techniques, Keyword Extraction, Topic Modeling, Summarization, and Document Similarity, through Unsupervised Learning. Finally, has built the data to extract threat information from SIRs, analysis applies to the Named Entity Recognition (NER) technology to recognize the words belonging to the IP, Domain/URL, Hash, Malware and determine if the word belongs to which type We propose a framework that applies a total of five analysis techniques, including technology.

Cross-Lingual Style-Based Title Generation Using Multiple Adapters (다중 어댑터를 이용한 교차 언어 및 스타일 기반의 제목 생성)

  • Yo-Han Park;Yong-Seok Choi;Kong Joo Lee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.8
    • /
    • pp.341-354
    • /
    • 2023
  • The title of a document is the brief summarization of the document. Readers can easily understand a document if we provide them with its title in their preferred styles and the languages. In this research, we propose a cross-lingual and style-based title generation model using multiple adapters. To train the model, we need a parallel corpus in several languages with different styles. It is quite difficult to construct this kind of parallel corpus; however, a monolingual title generation corpus of the same style can be built easily. Therefore, we apply a zero-shot strategy to generate a title in a different language and with a different style for an input document. A baseline model is Transformer consisting of an encoder and a decoder, pre-trained by several languages. The model is then equipped with multiple adapters for translation, languages, and styles. After the model learns a translation task from parallel corpus, it learns a title generation task from monolingual title generation corpus. When training the model with a task, we only activate an adapter that corresponds to the task. When generating a cross-lingual and style-based title, we only activate adapters that correspond to a target language and a target style. An experimental result shows that our proposed model is only as good as a pipeline model that first translates into a target language and then generates a title. There have been significant changes in natural language generation due to the emergence of large-scale language models. However, research to improve the performance of natural language generation using limited resources and limited data needs to continue. In this regard, this study seeks to explore the significance of such research.

Comments Classification System using Topic Signature (Topic Signature를 이용한 댓글 분류 시스템)

  • Bae, Min-Young;Cha, Jeong-Won
    • Journal of KIISE:Software and Applications
    • /
    • v.35 no.12
    • /
    • pp.774-779
    • /
    • 2008
  • In this work, we describe comments classification system using topic signature. Topic signature is widely used for selecting feature in document classification and summarization. Comments are short and have so many word spacing errors, special characters. We firstly convert comments into 7-gram. We consider the 7-gram as sentence. We convert the 7-gram into 3-gram. We consider the 3-gram as word. We select key feature using topic signature and classify new inputs by the Naive Bayesian method. From the result of experiments, we can see that the proposed method is outstanding over the previous methods.

News Clustering and Multi-Document Summarization for Real-time Issue Analysis (실시간 이슈 분석을 위한 뉴스 군집화 및 다중 문서 요약)

  • Yu, Hongyeon;Lee, Seungwoo;Ko, Youngjoong
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.132-137
    • /
    • 2018
  • 뉴스 기반의 실시간 이슈 분석을 위해서는 실시간으로 생성되는 다중 뉴스 기사 집합을 입력으로 받아 점증적으로 군집화 하고, 각 군집별 정보를 자동으로 요약하는 기술이 필요하다. 기존에는 정적인 데이터 기반의 군집화와 요약 각각에 대한 연구는 활발히 진행되고 있지만, 실시간으로 입력되는 대량의 데이터를 위한 점증적인 군집화와 요약에 대한 연구는 매우 부족하다. 따라서 본 논문에서는 실시간으로 입력되는 대량의 뉴스 기사 집합을 분석하기 위한 점증적이고 계층적인 뉴스 군집화 및 다중 문서 요약 방법을 제안한다. 평가를 위해서 2016년 10월, 11월 두 달간의 실제 데이터를 사용 하였으며, 전문 교육을 받은 연구원들이 Precision at k 기반의 정성평가를 진행하였다. 그 결과, 자동으로 생성된 12개의 군집에서 군집 성능은 평균 66% (상위계층 $l_1$: 82%, 하위계층 $l_2$: 43%), 요약 성능은 평균 92%를 얻었다.

  • PDF

A Dependency Graph-Based Keyphrase Extraction Method Using Anti-patterns

  • Batsuren, Khuyagbaatar;Batbaatar, Erdenebileg;Munkhdalai, Tsendsuren;Li, Meijing;Namsrai, Oyun-Erdene;Ryu, Keun Ho
    • Journal of Information Processing Systems
    • /
    • v.14 no.5
    • /
    • pp.1254-1271
    • /
    • 2018
  • Keyphrase extraction is one of fundamental natural language processing (NLP) tools to improve many text-mining applications such as document summarization and clustering. In this paper, we propose to use two novel techniques on the top of the state-of-the-art keyphrase extraction methods. First is the anti-patterns that aim to recognize non-keyphrase candidates. The state-of-the-art methods often used the rich feature set to identify keyphrases while those rich feature set cover only some of all keyphrases because keyphrases share very few similar patterns and stylistic features while non-keyphrase candidates often share many similar patterns and stylistic features. Second one is to use the dependency graph instead of the word co-occurrence graph that could not connect two words that are syntactically related and placed far from each other in a sentence while the dependency graph can do so. In experiments, we have compared the performances with different settings of the graphs (co-occurrence and dependency), and with the existing method results. Finally, we discovered that the combination method of dependency graph and anti-patterns outperform the state-of-the-art performances.

Sentence Extraction Using Adapting Method in Multi-Document Summarization (다중문서 요약에서 적응 기법을 이용한 문장 추출)

  • Lim, Jung-Min;Kang, In-Su;Bae, Jae-Hak J.;Lee, Jong-Hyeok
    • Annual Conference on Human and Language Technology
    • /
    • 2004.10d
    • /
    • pp.12-19
    • /
    • 2004
  • 기존의 다중 문서요약은 전체 대상문서에 대해서 한번에 요약문을 생산하지만, 본 논문은 요약 대상문서 집합에서 핵심내용을 갖는 문서를 기본 문서로 선택, 임시 요약문장을 추출하고 대상문서 집합에서 순차적으로 문서를 입력받아 중요문장을 추출, 이전에 구축된 요약문장과 현재 추출된 문장을 비교하면서 요약에 필요한 문장을 선택하는 적응 기법을 제안한다. 제안한 방법으로 구현한 시스템은 NTCIR TSC 3에서 사용된 29개의 다중 문서집합을 통해서 성능을 평가하였다. 적응 기법 시스템은 TSC3의 baseline시스템인 Lead 방법보다는 높은 성능을 나타냈지만, TSC 3에 참가한 시스템들과의 비교에서는 월등한 성능 우위를 나타내지 못했다.

  • PDF