• Title/Summary/Keyword: 토픽모델

Search Result 176, Processing Time 0.028 seconds

Topic Expansion based on Infinite Vocabulary Online LDA Topic Model using Semantic Correlation Information (무한 사전 온라인 LDA 토픽 모델에서 의미적 연관성을 사용한 토픽 확장)

  • Kwak, Chang-Uk;Kim, Sun-Joong;Park, Seong-Bae;Kim, Kweon Yang
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.9
    • /
    • pp.461-466
    • /
    • 2016
  • Topic expansion is an expansion method that reflects external data for improving quality of learned topic. The online learning topic model is not appropriate for topic expansion using external data, because it does not reflect unseen words to learned topic model. In this study, we proposed topic expansion method using infinite vocabulary online LDA. When unseen words appear in learning process, the proposed method allocates unseen word to topic after calculating semantic correlation between unseen word and each topic. To evaluate the proposed method, we compared with existing topic expansion method. The results indicated that the proposed method includes additional information that is not contained in broadcasting script by reflecting external documents. Also, the proposed method outperformed on coherence evaluation.

Topic-based Knowledge Graph-BERT (토픽 기반의 지식그래프를 이용한 BERT 모델)

  • Min, Chan-Wook;Ahn, Jin-Hyun;Im, Dong-Hyuk
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.05a
    • /
    • pp.557-559
    • /
    • 2022
  • 최근 딥러닝의 기술발전으로 자연어 처리 분야에서 Q&A, 문장추천, 개체명 인식 등 다양한 연구가 진행 되고 있다. 딥러닝 기반 자연어 처리에서 좋은 성능을 보이는 트랜스포머 기반 BERT 모델의 성능향상에 대한 다양한 연구도 함께 진행되고 있다. 본 논문에서는 토픽모델인 잠재 디리클레 할당을 이용한 토픽별 지식그래프 분류와 입력문장의 토픽을 추론하는 방법으로 K-BERT 모델을 학습한다. 분류된 토픽 지식그래프와 추론된 토픽을 이용해 K-BERT 모델에서 대용량 지식그래프 사용의 효율적 방법을 제안한다.

Representing the views of product data using extended Topic Maps (확장된 토픽맵을 이용한 제품 데이터에서의 관점의 표현)

  • 채희권;최영환;김광수
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 2003.05a
    • /
    • pp.1157-1164
    • /
    • 2003
  • 제품개발과정에서 생성된 제품정보모델은 시간에 따라 계속 변하고 미확정적인 정보가 포함된 UDM(Under Defined Model)이다. 정보모델에서 관점(viewpoint)은 UDM을 표현하고 관리하는 중요한 요소이다. 토픽맵(Topic Map) 이용한 정보모델은 관점의 표현이 용이하며, 관점에 따라 인간이 정보를 이해하고 조작하는 것을 돕는다. 그러나 토픽맵은 제품개발과정의 정보모델과 같은 UDM의 표현은 가능하나, 적합하지는 않다. 따라서 본 논문에서는 토픽맵이 UDM에 적합하도록 토픽맵의 문법을 확장하였다. 그리고 UDM으로부터 전자상거래에 적용 가능만 FDM(Fully Defined Model)으로 변화하는 과정에 대하여 논하였다. 관점이 적용된 UDM으로는 제품을 개발하는 과정 중에 생성되는 제품 모델을 적용하였으며, 대량생산이 된 이후의 제품 모델이나 제품개발단계에서 결정이 이루어진 후의 제품모델을 FDM 또는 UDM보다 모델의 의미가 보다 확정적인 확정적UDM을 사용하였다. 그리고 세탁기의 제품정보모델을 구현 예로 사용하여, UDM이 FDM 또는 확정적UDM으로 변화하는 과정을 설명하였다.

  • PDF

A case study of a broadcast script by using topic model (토픽 모델을 이용한 방송 대본 분석 사례 연구)

  • Noh, Yunseok;Kwak, Chang-Uk;Kim, Sun-Joong;Park, Seong-Bae;Lee, Sang-Jo
    • Annual Conference on Human and Language Technology
    • /
    • 2015.10a
    • /
    • pp.228-230
    • /
    • 2015
  • 방송 대본은 방송 콘텐츠에 대해 얻을 수 있는 가장 주요한 텍스트 데이터 중에 하나이다. 본 논문에서는 토픽 모델을 통해 방송 대본 분석을 수행하고 그 결과를 제시한다. 방송 대본을 토픽 모델로 학습하기 위해 대본의 장면 단위로 문서를 구성하여 학습하여 대본의 장면을 분석하고 등장인물 단위로 문서를 구성하여 등장인물을 분석하여 그 특징을 살펴본다. 토픽 모델을 사용하여 방송 대본을 분석하는 과정에서 방송 대본이 가지는 특징을 분석하고 그로부터 향후 연구방향에 대해 논의한다.

  • PDF

A Prestigious University Students' Perceptions of their Educational Attainment by a Topic model (토픽모델을 활용한 명문대 재학생의 학벌에 관한 인식 분석)

  • Young Son Jung;Seung-Yun Lee
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.3
    • /
    • pp.503-512
    • /
    • 2024
  • This study examines the essays of academic background, written by students from a university, which is classified into prestigious universities in Korean society. By Latent Dirichlet Allocation, 172 essays were analyzed to explore the students' perspectives of the academic fractionalism. The analysis identified five topics such as, functional aspects (Topic 1), double-edged nature (Topic 2), power communities (Topic 3), symbols of victory (Topic 4), and dysfunctional aspects (Topic 5). The most frequently appearing keywords are 'individual,' 'status,' and 'means' in Topic 1, 'definition,' 'school,' and 'meaning' in Topic 2, 'people,' 'origin,' and 'power' in Topic 3, 'university,' 'ability,' and 'effort' in Topic 4, and 'academic achievement,' 'South Korea,' and 'origin' in Topic 5. By exploring the topics, we found that students regarded class reproduction by education as important social issues and they showed little interest in other factors influencing academic fractionalism, such as race or ethnicity. these findings suggest that professars, who teach the impact of education on academic fractionalism, deal with the influence of diverse factors on academic fractionalism.

Automatic Generating Stopword Methods for Improving Topic Model (토픽모델의 성능 향상을 위한 불용어 자동 생성 기법)

  • Lee, Jung-Been;In, Hoh Peter
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.04a
    • /
    • pp.869-872
    • /
    • 2017
  • 정보검색(Information retrieval) 및 텍스트 분석을 위해 수집하는 비정형 데이터 즉, 자연어를 전처리하는 과정 중 하나인 불용어(Stopword) 제거는 모델의 품질을 높일 수 있는 쉽고, 효과적인 방법 중에 하나이다. 특히 다양한 텍스트 문서에 잠재된 주제를 추출하는 기법인 토픽모델링의 경우, 너무 오래되거나, 수집된 문서의 도메인이나 성격과 무관한 불용어의 제거로 인해, 해당 토픽 모델에서 학습되어 생성된 주제 관련 단어들의 일관성이 떨어지게 된다. 따라서 분석가가 분류된 주제를 올바르게 해석하는데 있어 많은 어려움이 따르게 된다. 본 논문에서는 이러한 문제점을 해결하기 위해 일반적으로 사용되는 표준 불용어 대신 관련 도메인 문서로부터 추출되는 점별 상호정보량(PMI: Pointwise Mutual Information)을 이용하여 불용어를 자동으로 생성해주는 기법을 제안한다. 생성된 불용어와 표준 불용어를 통해 토픽 모델의 품질을 혼잡도(Perplexity)로써 측정한 결과, 본 논문에서 제안한 기법으로 생성한 30개의 불용어가 421개의 표준 불용어보다 더 높은 모델 성능을 보였다.

Building a Korean-English Parallel Corpus by Measuring Sentence Similarities Using Sequential Matching of Language Resources and Topic Modeling (언어 자원과 토픽 모델의 순차 매칭을 이용한 유사 문장 계산 기반의 위키피디아 한국어-영어 병렬 말뭉치 구축)

  • Cheon, JuRyong;Ko, YoungJoong
    • Journal of KIISE
    • /
    • v.42 no.7
    • /
    • pp.901-909
    • /
    • 2015
  • In this paper, to build a parallel corpus between Korean and English in Wikipedia. We proposed a method to find similar sentences based on language resources and topic modeling. We first applied language resources(Wiki-dictionary, numbers, and online dictionary in Daum) to match word sequentially. We construct the Wiki-dictionary using titles in Wikipedia. In order to take advantages of the Wikipedia, we used translation probability in the Wiki-dictionary for word matching. In addition, we improved the accuracy of sentence similarity measuring method by using word distribution based on topic modeling. In the experiment, a previous study showed 48.4% of F1-score with only language resources based on linear combination and 51.6% with the topic modeling considering entire word distributions additionally. However, our proposed methods with sequential matching added translation probability to language resources and achieved 9.9% (58.3%) better result than the previous study. When using the proposed sequential matching method of language resources and topic modeling after considering important word distributions, the proposed system achieved 7.5%(59.1%) better than the previous study.

Analysis System for SNS Issues per Country based on Topic Model (토픽 모델 기반의 국가 별 SNS 관심 이슈 분석 시스템)

  • Kim, Seong Hoon;Yoon, Ji Won
    • Journal of KIISE
    • /
    • v.43 no.11
    • /
    • pp.1201-1209
    • /
    • 2016
  • As the use of SNS continues to increase, various related studies have been conducted. According to the effectiveness of the topic model for existing theme extraction, a huge number of related research studies on topic model based analysis have been introduced. In this research, we suggested an automation system to analyze topics of each country and its distribution in twitter by combining world map visualization and issue matching method. The core system components are the following three modules; 1) collection of tweets and classification by nation, 2) extraction of topics and distribution by country based on topic model algorithm, and 3) visualization of topics and distribution based on Google geochart. In experiments with USA and UK, we could find issues of the two nations and how they changed. Based on these results, we could analyze the differences of each nation's position on ISIS problem.

Topic maps Matching and Merging Techniques based on Partitioning of Topics (토픽 분할에 의한 토픽맵 매칭 및 통합 기법)

  • Kim, Jung-Min;Chung, Hyun-Sook
    • The KIPS Transactions:PartD
    • /
    • v.14D no.7
    • /
    • pp.819-828
    • /
    • 2007
  • In this paper, we propose a topic maps matching and merging approach based on the syntactic or semantic characteristics and constraints of the topic maps. Previous schema matching approaches have been developed to enhance effectiveness and generality of matching techniques. However they are inefficient because the approaches should transform input ontologies into graphs and take into account all the nodes and edges of the graphs, which ended up requiring a great amount of processing time. Now, standard languages for developing ontologies are RDF/OWL and Topic Maps. In this paper, we propose an enhanced version of matching and merging technique based on topic partitioning, several matching operations and merging conflict detection.

Semantic Dependency Link Topic Model for Biomedical Acronym Disambiguation (의미적 의존 링크 토픽 모델을 이용한 생물학 약어 중의성 해소)

  • Kim, Seonho;Yoon, Juntae;Seo, Jungyun
    • Journal of KIISE
    • /
    • v.41 no.9
    • /
    • pp.652-665
    • /
    • 2014
  • Many important terminologies in biomedical text are expressed as abbreviations or acronyms. We newly suggest a semantic link topic model based on the concepts of topic and dependency link to disambiguate biomedical abbreviations and cluster long form variants of abbreviations which refer to the same senses. This model is a generative model inspired by the latent Dirichlet allocation (LDA) topic model, in which each document is viewed as a mixture of topics, with each topic characterized by a distribution over words. Thus, words of a document are generated from a hidden topic structure of a document and the topic structure is inferred from observable word sequences of document collections. In this study, we allow two distinct word generation to incorporate semantic dependencies between words, particularly between expansions (long forms) of abbreviations and their sentential co-occurring words. Besides topic information, the semantic dependency between words is defined as a link and a new random parameter for the link presence is assigned to each word. As a result, the most probable expansions with respect to abbreviations of a given abstract are decided by word-topic distribution, document-topic distribution, and word-link distribution estimated from document collection though the semantic dependency link topic model. The abstracts retrieved from the MEDLINE Entrez interface by the query relating 22 abbreviations and their 186 expansions were used as a data set. The link topic model correctly predicted expansions of abbreviations with the accuracy of 98.30%.