• Title/Summary/Keyword: Topic analysis

Search Result 2,029, Processing Time 0.036 seconds

Efficient Topic Modeling by Mapping Global and Local Topics (전역 토픽의 지역 매핑을 통한 효율적 토픽 모델링 방안)

  • Choi, Hochang;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.3
    • /
    • pp.69-94
    • /
    • 2017
  • Recently, increase of demand for big data analysis has been driving the vigorous development of related technologies and tools. In addition, development of IT and increased penetration rate of smart devices are producing a large amount of data. According to this phenomenon, data analysis technology is rapidly becoming popular. Also, attempts to acquire insights through data analysis have been continuously increasing. It means that the big data analysis will be more important in various industries for the foreseeable future. Big data analysis is generally performed by a small number of experts and delivered to each demander of analysis. However, increase of interest about big data analysis arouses activation of computer programming education and development of many programs for data analysis. Accordingly, the entry barriers of big data analysis are gradually lowering and data analysis technology being spread out. As the result, big data analysis is expected to be performed by demanders of analysis themselves. Along with this, interest about various unstructured data is continually increasing. Especially, a lot of attention is focused on using text data. Emergence of new platforms and techniques using the web bring about mass production of text data and active attempt to analyze text data. Furthermore, result of text analysis has been utilized in various fields. Text mining is a concept that embraces various theories and techniques for text analysis. Many text mining techniques are utilized in this field for various research purposes, topic modeling is one of the most widely used and studied. Topic modeling is a technique that extracts the major issues from a lot of documents, identifies the documents that correspond to each issue and provides identified documents as a cluster. It is evaluated as a very useful technique in that reflect the semantic elements of the document. Traditional topic modeling is based on the distribution of key terms across the entire document. Thus, it is essential to analyze the entire document at once to identify topic of each document. This condition causes a long time in analysis process when topic modeling is applied to a lot of documents. In addition, it has a scalability problem that is an exponential increase in the processing time with the increase of analysis objects. This problem is particularly noticeable when the documents are distributed across multiple systems or regions. To overcome these problems, divide and conquer approach can be applied to topic modeling. It means dividing a large number of documents into sub-units and deriving topics through repetition of topic modeling to each unit. This method can be used for topic modeling on a large number of documents with limited system resources, and can improve processing speed of topic modeling. It also can significantly reduce analysis time and cost through ability to analyze documents in each location or place without combining analysis object documents. However, despite many advantages, this method has two major problems. First, the relationship between local topics derived from each unit and global topics derived from entire document is unclear. It means that in each document, local topics can be identified, but global topics cannot be identified. Second, a method for measuring the accuracy of the proposed methodology should be established. That is to say, assuming that global topic is ideal answer, the difference in a local topic on a global topic needs to be measured. By those difficulties, the study in this method is not performed sufficiently, compare with other studies dealing with topic modeling. In this paper, we propose a topic modeling approach to solve the above two problems. First of all, we divide the entire document cluster(Global set) into sub-clusters(Local set), and generate the reduced entire document cluster(RGS, Reduced global set) that consist of delegated documents extracted from each local set. We try to solve the first problem by mapping RGS topics and local topics. Along with this, we verify the accuracy of the proposed methodology by detecting documents, whether to be discerned as the same topic at result of global and local set. Using 24,000 news articles, we conduct experiments to evaluate practical applicability of the proposed methodology. In addition, through additional experiment, we confirmed that the proposed methodology can provide similar results to the entire topic modeling. We also proposed a reasonable method for comparing the result of both methods.

Comparative Analysis of the Keywords in Taekwondo News Articles by Year: Applying Topic Modeling Method (태권도 뉴스기사의 연도별 주제어 비교분석: 토픽모델링 적용)

  • Jeon, Minsoo;Lim, Hyosung
    • Journal of Digital Convergence
    • /
    • v.19 no.11
    • /
    • pp.575-583
    • /
    • 2021
  • This study aims to analyze Taekwondo trends according to news articles by year by applying topic modeling. In order to examine the Taekwondo trend through media reports, articles including news articles and Taekwondo specialized media articles were collected through Big Kinds of the Korea Press Foundation. The search period was divided into three sections: before 2000, 2001~2010, and 2011~2020. A total of 12,124 items were selected as research data. For topic analysis, pre-processing was performed, and topic analysis was performed using the LDA algorithm. In this case, python 3 was applied for all analysis. First, as a result of analyzing the topics of media articles by year, 'World' was the most common keyword before 2000. 'South and North Korea' was next common and 'Olympic' was the third commonest topic. From 2001 to 2010, 'World' was the most common topic, followed by 'Association' and 'World Taekwondo'. From 2011 to 2020, 'World', 'Demonstration', and 'Kukkiwon' was the most common topic in that order. Second, as a result of analyzing news articles before 2000 by topic modeling, topics were divided into two categories. Specifically, Topic 1 was selected as 'South-North Korea sports exchange' and Topic 2 was selected as 'Adoption of Olympic demonstration events'. Third, as a result of analyzing news articles from 2001 to 2010 by topic modeling, three topics were selected. Topic 1 was selected as 'Taekwondo Demonstration Performance and Corruption', Topic 2 was selected as 'Muju Taekwondo Park Creation', and Topic 3 was selected as 'World Taekwondo Festival'. Fourth, as a result of analyzing news articles from 2011 to 2020 by topic modeling, three topics were selected. Topic 1 was selected as 'Successful Hosting of the 2018 Pyeongchang Winter Olympics', Topic 2 was selected as 'North-South Korea Taekwondo Joint Demonstration Performance', and Topic 3 was selected as '2017 Muju World Taekwondo Championships'.

Research Trend on Diabetes Mobile Applications: Text Network Analysis and Topic Modeling (당뇨병 모바일 앱 관련 연구동향: 텍스트 네트워크 분석 및 토픽 모델링)

  • Park, Seungmi;Kwak, Eunju;Kim, Youngji
    • Journal of Korean Biological Nursing Science
    • /
    • v.23 no.3
    • /
    • pp.170-179
    • /
    • 2021
  • Purpose: The aim of this study was to identify core keywords and topic groups in the 'Diabetes mellitus and mobile applications' field of research for better understanding research trends in the past 20 years. Methods: This study was a text-mining and topic modeling study including four steps such as 'collecting abstracts', 'extracting and cleaning semantic morphemes', 'building a co-occurrence matrix', and 'analyzing network features and clustering topic groups'. Results: A total of 789 papers published between 2002 and 2021 were found in databases (Springer). Among them, 435 words were extracted from 118 articles selected according to the conditions: 'analyzed by text network analysis and topic modeling'. The core keywords were 'self-management', 'intervention', 'health', 'support', 'technique' and 'system'. Through the topic modeling analysis, four themes were derived: 'intervention', 'blood glucose level control', 'self-management' and 'mobile health'. The main topic of this study was 'self-management'. Conclusion: While more recent work has investigated mobile applications, the highest feature was related to self-management in the diabetes care and prevention. Nursing interventions utilizing mobile application are expected to not only effective and powerful glycemic control and self-management tools, but can be also used for patient-driven lifestyle modification.

Topic Modeling Analysis of Beauty Industry using BERTopic and LDA

  • YANG, Hoe-Chang;LEE, Won-Dong
    • The Journal of Economics, Marketing and Management
    • /
    • v.10 no.6
    • /
    • pp.1-7
    • /
    • 2022
  • Purpose: The purpose of this study is identifying the research trends of degree papers related to the beauty industry and providing information which can contribute to the development of the domestic beauty industry and the direction of various research about beauty industry. Research design, data and methodology: This study used 154 academic papers and 189 academic papers with English abstracts out of 299 academic papers. All of these papers were found by searching for the keyword "beauty industry" in ScienceON on August 15, 2022. For the analysis, BERTopic and LDA (Latent Dirichlet Allocation) analysis were conducted using Python 3.7. Also, OLS regression analysis was conducted to understand the annual increase and decrease trend of each topic derived with trend analysis. Results: As a result of word frequency analysis, the frequency of satisfaction, management, behavior, and service was found to be high. In addition, it was found that 'service', 'satisfaction' and 'customer' were frequently associated with program and relationship in the word co-occurrence frequency analysis. As a result of topic modeling, six topics were derived: 'Beauty shop', 'Health education', 'Cosmetics', 'Customer satisfaction', 'Beauty education', and 'Beauty business'. The trend analysis result of each topic confirmed that 'Beauty education' and 'Health education' are getting more attention as time goes by. Conclusions: The future studies must resolve the extreme polarization between the structure of the small beauty industry and beauty stores. Furthermore, the researches have to direct various ways to create the performance of internal personnel. The ways to maximize product capabilities such as competitive cosmetics and brands are also needed attentions.

Keywords and Topic Analysis of Social Issues on Twitter Based on Text Mining and Topic Modeling (텍스트 마이닝과 토픽 모델링을 기반으로 한 트위터에 나타난 사회적 이슈의 키워드 및 주제 분석)

  • Kwak, Soo Jeong;Kim, Hyon Hee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.1
    • /
    • pp.13-18
    • /
    • 2019
  • In this study, we investigate important keywords and their relationships among the keywords for social issues, and analyze topics to find subjects of the social issues. In particular, we collected twitter data with the keyword 'metoo' which has attracted much attention in these days, and perform keyword analysis and topic modeling. First, we preprocess the twitter data, identified important keywords, and analyzed the relatedness of the keywords. After then, topic modeling is performed to find subjects related to 'metoo'. Our experimental results showed that relatedness of keywords and subjects on social issues in twitter are well identified based on keyword analysis and topic modeling.

The Impacts of Online Game Reviews' Characteristics on Review Helpfulness: Based on Topic Modeling Analysis (온라인 게임 리뷰의 특성이 리뷰 유용성에 미치는 영향: 토픽모델링을 활용하여)

  • Bae, Sung Hun;Kim, Hyun Mook;Lee, Ui Jun;Lee, Sae Rom
    • The Journal of Information Systems
    • /
    • v.31 no.4
    • /
    • pp.161-187
    • /
    • 2022
  • Purpose This study analyzed the topic of game review contents and how the characteristics of game reviews affect the reviews helpfulness. In addition, this study explore the content of game reviews according to the game's sales strategy such as early access strategy and releasing without early access. Design/methodology/approach We collected a list of 3,572 action genre games released in 2020. 58,336 online reviews were collected by random sampling 50 reviews in each games, and topic modeling was performed on those reviews. We dynamized the results of topic modeling and analyzed the effect on review helpfulness with multiple regression analysis. Findings The results of analysis indicate that the longer the review is or the shorter the time it is written, the more helpful the review is. In addition the topic with positive and negative review has a significant effect on the review helpfulness. As a result of exploratory analysis, games from early access had relatively fewer reviews of story-related topics than games that were released without early access. These findings can present direct guidelines for collecting specific opinions from customers in the game industry when releasing games.

Seasonal analysis of Beach-related Issues using Local Newspaper Articles and Topic Modeling (지역신문기사 자료와 토픽모델링을 이용한 해변 관련 계절별 현안분석)

  • Yoo, Mu-Sang;Jeong, Su-Yeon;Kim, Geon-Hu;Sohn, Chul
    • Journal of the Korean Regional Science Association
    • /
    • v.34 no.4
    • /
    • pp.19-34
    • /
    • 2018
  • The purpose of this study is to analyze the seasonal issues using the local newspaper articles with the keyword beach from 2004 to 2017. Topic modeling and Time series regression analysis based on open source programs were performed for analysis. Topic modeling results showed 35 topics in spring, 47 topics in summer, 36 topics in autumn and 35 topics in winter. The common themes were 'beaches', 'festivals and events', 'accident and environmental issues', 'tourism', 'development and sale', 'administration and policy' and 'weather'. Time series regression analysis showed in the spring, 5 Hot-Topics and 2 Cold-Topic were found out of the 35 topics. In the summer, 6 Hot-Topics and 3 Cold-Topic were found out of the 47 topics. In the autumn, 4 Hot-Topics and 3 Cold-Topic were found out of the 36 topics. In the winter, 3 Hot-Topics and 3 Cold-Topic were found out of the 35 topics. And for each season, topics that do not fall into the Hot-Topic and Cold-Topic are classified as Neutral-Topic. In this study if seasonal uses are different such as beaches are deemed that seasonal topic modeling for analysis of regional issues will yield more useful results and enable detailed diagnosis.

Generative probabilistic model with Dirichlet prior distribution for similarity analysis of research topic

  • Milyahilu, John;Kim, Jong Nam
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.4
    • /
    • pp.595-602
    • /
    • 2020
  • We propose a generative probabilistic model with Dirichlet prior distribution for topic modeling and text similarity analysis. It assigns a topic and calculates text correlation between documents within a corpus. It also provides posterior probabilities that are assigned to each topic of a document based on the prior distribution in the corpus. We then present a Gibbs sampling algorithm for inference about the posterior distribution and compute text correlation among 50 abstracts from the papers published by IEEE. We also conduct a supervised learning to set a benchmark that justifies the performance of the LDA (Latent Dirichlet Allocation). The experiments show that the accuracy for topic assignment to a certain document is 76% for LDA. The results for supervised learning show the accuracy of 61%, the precision of 93% and the f1-score of 96%. A discussion for experimental results indicates a thorough justification based on probabilities, distributions, evaluation metrics and correlation coefficients with respect to topic assignment.

Topics and Trends in Metadata Research

  • Oh, Jung Sun;Park, Ok Nam
    • Journal of Information Science Theory and Practice
    • /
    • v.6 no.4
    • /
    • pp.39-53
    • /
    • 2018
  • While the body of research on metadata has grown substantially, there has been a lack of systematic analysis of the field of metadata. In this study, we attempt to fill this gap by examining metadata literature spanning the past 20 years. With the combination of a text mining technique, topic modeling, and network analysis, we analyzed 2,713 scholarly papers on metadata published between 1995 and 2014 and identified main topics and trends in metadata research. As the result of topic modeling, 20 topics were discovered and, among those, the most prominent topics were reviewed in detail. In addition, the changes over time in the topic composition, in terms of both the relative topic proportions and the structure of topic networks, were traced to find past and emerging trends in research. The results show that a number of core themes in metadata research have been established over the past decades and the field has advanced, embracing and responding to the dynamic changes in information environments as well as new developments in the professional field.

Representing Topic-Comment Structures in Chinese

  • Pan, Haihua;Hu, Jianhua
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2002.02a
    • /
    • pp.382-390
    • /
    • 2002
  • Shi (2000) claims that topics must be related to a syntactic position in the comment, thus denying the existence of dangling topics in Chinese. Under Shi's analysis, the dangling topic sentences in Chinese are not topic-comment but subject-predicate sentences. However, Shi's arguments are not without problems. In this paper we argue that topics in Chinese can be licensed not only by a syntactic gap but also by a semantic gap/variable without syntactic realization. Under our analysis, all the dangling topics discussed in Shi (2000) are, in fact, not subjects but topics licensed by a semantic gap/variable that can turn the relevant comment into an open predicate, thus licensing dangling topics and deriving well-formed topic-comment constructions. Our analysis fares better than Shi's in not only unifying the licensing mechanism of a topic to an open predicate without considering how the open predicate is derived, but also unifying the treatment of normal and dangling topics in Chinese,

  • PDF