• Title/Summary/Keyword: topic model

Search Result 870, Processing Time 0.021 seconds

Research on Multi-facted News Article Classification Models Classifying Subjects, Geographies and Genres (심층 주제, 지역, 장르를 모두 분류할 수 있는 다면적 뉴스 기사 자동 분류 모델 연구)

  • Hyojin Lee;SungPil Choi
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.58 no.3
    • /
    • pp.65-89
    • /
    • 2024
  • This study developed a model to classify news articles into categories of topic, genre, and region using a Korean Pre-trained Language model. To achieve this, a new news article classification system was designed by referring to the classification systems of domestic media outlets. The topic and genre classification models were implemented as hierarchical classification models that link the main categories and subcategories, and their performance was compared with that of an integrated category model. The evaluation results showed that the hierarchical structure classification model had the advantage of providing more precise categorization in ambiguous or overlapping categories compared to the integrated category model. For regional classification of news articles, a model was built to classify into 18 categories, and for regional news articles, the regional characteristics were clearly reflected in the text, resulting in high performance. This study demonstrated the effectiveness of classifying news articles from multiple perspectives-topic, genre, and region-and emphasized the significance of suggesting the potential for a multi-dimensional news article classification service that meets user needs.

Tweets analysis using a Dynamic Topic Modeling : Focusing on the 2019 Koreas-US DMZ Summit (트윗의 타임 시퀀스를 활용한 DTM 분석 : 2019 남북미정상회동 이벤트를 중심으로)

  • Ko, EunJi;Choi, SunYoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.2
    • /
    • pp.308-313
    • /
    • 2021
  • In this study, tweets about the 2019 Koreas-US DMZ Summit were collected along with a time sequence and analyzed by a sequential topic modeling method, Dynamic Topic Modeling(DTM). In microblogging services such as Twitter, unstructured data that mixes news and an opinion about a single event occurs at the same time on a large scale, and information and reactions are produced in the same message format. Therefore, to grasp a topic trend, the contextual meaning can be found only by performing pattern analysis reflecting the characteristics of sequential data. As a result of calculating the DTM after obtaining the topic coherence score and evaluating the Latent Dirichlet Allocation(LDA), 30 topics related to news reports and opinions were derived, and the probability of occurrence of each topic and keywords were dynamically evolving. In conclusion, the study found that DTM is a suitable model for analyzing the trend of integrated topics in a specific event over time.

XTM based Knowledge Exchanges for Product Configuration Modeling (XML Topic Map을 이용한 Product Configuration 지식 교환에 관한 연구)

  • Cho J.;Kwak H.W.;Kim H.;Kim H.S.;Lee J.H.;Cho J.M.;Hong C.S.;Do N.
    • Korean Journal of Computational Design and Engineering
    • /
    • v.11 no.1
    • /
    • pp.57-66
    • /
    • 2006
  • Modeling product configurations needs large amounts of knowledge about technical and marketing restrictions on the product. Previous attempts to automate product configurations concentrate on representations and management of the knowledge for specific domains in fixed and isolated computing environments. Since the knowledge about product configurations is subject to continuous change and hard to express, these attempts often failed to efficiently manage and exchange the knowledge in collaborative product development. In this paper, XML Topic Map (XTM) is introduced to represent and exchange the knowledge about product configurations in collaborative product development. A product configuration model based on XTM along with its merger and inference facilities enables configuration engineers In collaborative product development to manage and exchange their knowledge efficiently. An implementation of the proposed product configuration model is presented to demonstrate that the proposed approach enables enterprises to exchange the knowledge about product configurations during their collaborative product development.

A New Fine-grain SMS Corpus and Its Corresponding Classifier Using Probabilistic Topic Model

  • Ma, Jialin;Zhang, Yongjun;Wang, Zhijian;Chen, Bolun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.2
    • /
    • pp.604-625
    • /
    • 2018
  • Nowadays, SMS spam has been overflowing in many countries. In fact, the standards of filtering SMS spam are different from country to country. However, the current technologies and researches about SMS spam filtering all focus on dividing SMS message into two classes: legitimate and illegitimate. It does not conform to the actual situation and need. Furthermore, they are facing several difficulties, such as: (1) High quality and large-scale SMS spam corpus is very scarce, fine categorized SMS spam corpus is even none at all. This seriously handicaps the researchers' studies. (2) The limited length of SMS messages lead to lack of enough features. These factors seriously degrade the performance of the traditional classifiers (such as SVM, K-NN, and Bayes). In this paper, we present a new fine categorized SMS spam corpus which is unique and the largest one as far as we know. In addition, we propose a classifier, which is based on the probability topic model. The classifier can alleviate feature sparse problem in the task of SMS spam filtering. Moreover, we compare the approach with three typical classifiers on the new SMS spam corpus. The experimental results show that the proposed approach is more effective for the task of SMS spam filtering.

A Study on Design and Analysis of Metadata and Ontology based on Humanities and Social Sciences (기초학문자료 메타데이터 설계 분석 및 온톨로지 적용 방안 연구)

  • Lee, Jung-Yeoun;Kim, Jung-Min;Choi, Suk-Doo;Kim, Lee-Kyum
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.41 no.2
    • /
    • pp.291-316
    • /
    • 2007
  • The purpose of this study is to design metadata model for describing different kinds of concepts, properties, and semantic relationships of result materials of researches. We examine our metadata model to evaluate correctness and efficiency of the model through contents analysis of a constructed database. From the results of examination, we suggest more effective structure of metadata schema. Domain ontology could constructed by the enlarged thesaurus in order to overcome the limitation of the keyword search, therefore we design a philosophy and religion ontology based on subject classification to improve information retrieval and implement it using XML/Topic Maps to improve retrieval functionality of our database.

Falling Accidents Analysis in Construction Sites by Using Topic Modeling (토픽 모델링을 이용한 건설현장 추락재해 분석)

  • Ryu, Hanguk
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.7
    • /
    • pp.175-182
    • /
    • 2019
  • We classify topics on fall incidents occurring in construction sites using topic modeling among machine learning techniques and analyze the causes of the accidents according to each topic. In order to apply topic modeling based on latent dirichlet allocation, text data was preprocessed and evaluated with Perplexity score to improve the reliability of the model. The most common falling accidents happened to the daily workers belonging to small construction site. Most of the causes were not operated properly due to lack of safety equipment, inadequacy of arrangement and wearing, and low performance of safety equipment. In order to prevent and reduce the falling accidents, it is important to educate the daily workers of small construction site, arrange the workplace, and check the wearing of personal safety equipment and device.

Research on Community Knowledge Modeling of Readers Based on Interest Labels

  • Kai, Wang;Wei, Pan;Xingzhi, Chen
    • Journal of Information Processing Systems
    • /
    • v.19 no.1
    • /
    • pp.55-66
    • /
    • 2023
  • Community portraits can deeply explore the characteristics of community structures and describe the personalized knowledge needs of community users, which is of great practical significance for improving community recommendation services, as well as the accuracy of resource push. The current community portraits generally have the problems of weak perception of interest characteristics and low degree of integration of topic information. To resolve this problem, the reader community portrait method based on the thematic and timeliness characteristics of interest labels (UIT) is proposed. First, community opinion leaders are identified based on multi-feature calculations, and then the topic features of their texts are identified based on the LDA topic model. On this basis, a semantic mapping including "reader community-opinion leader-text content" was established. Second, the readers' interest similarity of the labels was dynamically updated, and two kinds of tag parameters were integrated, namely, the intensity of interest labels and the stability of interest labels. Finally, the similarity distance between the opinion leader and the topic of interest was calculated to obtain the dynamic interest set of the opinion leaders. Experimental analysis was conducted on real data from the Douban reading community. The experimental results show that the UIT has the highest average F value (0.551) compared to the state-of-the-art approaches, which indicates that the UIT has better performance in the smooth time dimension.

Topic Modeling of Korean Newspaper Articles on Aging via Latent Dirichlet Allocation

  • Lee, So Chung
    • Asian Journal for Public Opinion Research
    • /
    • v.10 no.1
    • /
    • pp.4-22
    • /
    • 2022
  • The purpose of this study is to explore the structure of social discourse on aging in Korea by analyzing newspaper articles on aging. The analysis is composed of three steps: first, data collection and preprocessing; second, identifying the latent topics; and third, observing yearly dynamics of topics. In total, 1,472 newspaper articles that included the word "aging" within the title were collected from 10 major newspapers between 2006 and 2019. The underlying topic structure was analyzed using Latent Dirichlet Allocation (LDA), a topic modeling method widely adopted by text mining academics and researchers. Seven latent topics were generated from the LDA model, defined as social issues, death, private insurance, economic growth, national debt, labor market innovation, and income security. The topic loadings demonstrated a clear increase in public interest on topics such as national debt and labor market innovation in recent years. This study concludes that media discourse on aging has shifted towards more productivity and efficiency related issues, requiring older people to be productive citizens. Such subjectivation connotes a decreased role of the government and society by shifting the responsibility to individuals not being able to adapt successfully as productive citizens within the labor market.

WV-BTM: A Technique on Improving Accuracy of Topic Model for Short Texts in SNS (WV-BTM: SNS 단문의 주제 분석을 위한 토픽 모델 정확도 개선 기법)

  • Song, Ae-Rin;Park, Young-Ho
    • Journal of Digital Contents Society
    • /
    • v.19 no.1
    • /
    • pp.51-58
    • /
    • 2018
  • As the amount of users and data of NS explosively increased, research based on SNS Big data became active. In social mining, Latent Dirichlet Allocation(LDA), which is a typical topic model technique, is used to identify the similarity of each text from non-classified large-volume SNS text big data and to extract trends therefrom. However, LDA has the limitation that it is difficult to deduce a high-level topic due to the semantic sparsity of non-frequent word occurrence in the short sentence data. The BTM study improved the limitations of this LDA through a combination of two words. However, BTM also has a limitation that it is impossible to calculate the weight considering the relation with each subject because it is influenced more by the high frequency word among the combined words. In this paper, we propose a technique to improve the accuracy of existing BTM by reflecting semantic relation between words.

The Arms Race on the Road: Exploring Factors of SUVs' Popularity by LDA Topic Model (도로 위의 군비경쟁: LDA 토픽모델을 활용한 SUV의 인기 요인 탐구)

  • Jeon, Seung-Bong;Goh, Taekyeong
    • Journal of Digital Convergence
    • /
    • v.18 no.10
    • /
    • pp.239-252
    • /
    • 2020
  • By using text mining, we explore the factors responsible for an increase in SUV preference. We collected 32,679 posts related to SUVs from "Bobaedream," the largest online automobile community in South Korea, and applied the LDA topic model. While previous studies have explained the SUV boom as an individual's risk aversion strategy from crime, the result shows that the topic of 'Safety' appears to be an important factor in the SUV discourse in the context of a car accident and high-speed driving situation. To conclude, the consumption of SUVs in Korean society serves as a mean to prevent anxiety and danger to individuals when driving. We insist that decreasing social trust, caused by an increase in inequality, underlies the perception of risk on the road.