• Title/Summary/Keyword: topic extraction

Search Result 123, Processing Time 0.029 seconds

News Topic Extraction based on Word Similarity (단어 유사도를 이용한 뉴스 토픽 추출)

  • Jin, Dongxu;Lee, Soowon
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1138-1148
    • /
    • 2017
  • Topic extraction is a technology that automatically extracts a set of topics from a set of documents, and this has been a major research topic in the area of natural language processing. Representative topic extraction methods include Latent Dirichlet Allocation (LDA) and word clustering-based methods. However, there are problems with these methods, such as repeated topics and mixed topics. The problem of repeated topics is one in which a specific topic is extracted as several topics, while the problem of mixed topic is one in which several topics are mixed in a single extracted topic. To solve these problems, this study proposes a method to extract topics using an LDA that is robust against the problem of repeated topic, going through the steps of separating and merging the topics using the similarity between words to correct the extracted topics. As a result of the experiment, the proposed method showed better performance than the conventional LDA method.

Topic Extraction and Classification Method Based on Comment Sets

  • Tan, Xiaodong
    • Journal of Information Processing Systems
    • /
    • v.16 no.2
    • /
    • pp.329-342
    • /
    • 2020
  • In recent years, emotional text classification is one of the essential research contents in the field of natural language processing. It has been widely used in the sentiment analysis of commodities like hotels, and other commentary corpus. This paper proposes an improved W-LDA (weighted latent Dirichlet allocation) topic model to improve the shortcomings of traditional LDA topic models. In the process of the topic of word sampling and its word distribution expectation calculation of the Gibbs of the W-LDA topic model. An average weighted value is adopted to avoid topic-related words from being submerged by high-frequency words, to improve the distinction of the topic. It further integrates the highest classification of the algorithm of support vector machine based on the extracted high-quality document-topic distribution and topic-word vectors. Finally, an efficient integration method is constructed for the analysis and extraction of emotional words, topic distribution calculations, and sentiment classification. Through tests on real teaching evaluation data and test set of public comment set, the results show that the method proposed in the paper has distinct advantages compared with other two typical algorithms in terms of subject differentiation, classification precision, and F1-measure.

Analysis of trends in deep learning and reinforcement learning

  • Dong-In Choi;Chungsoo Lim
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.10
    • /
    • pp.55-65
    • /
    • 2023
  • In this paper, we apply KeyBERT(Keyword extraction with Bidirectional Encoder Representations of Transformers) algorithm-driven topic extraction and topic frequency analysis to deep learning and reinforcement learning research to discover the rapidly changing trends in them. First, we crawled abstracts of research papers on deep learning and reinforcement learning, and temporally divided them into two groups. After pre-processing the crawled data, we extracted topics using KeyBERT algorithm, and then analyzed the extracted topics in terms of topic occurrence frequency. This analysis reveals that there are distinct trends in research work of all analyzed algorithms and applications, and we can clearly tell which topics are gaining more interest. The analysis also proves the effectiveness of the utilized topic extraction and topic frequency analysis in research trend analysis, and this trend analysis scheme is expected to be used for research trend analysis in other research fields. In addition, the analysis can provide insight into how deep learning will evolve in the near future, and provide guidance for select research topics and methodologies by informing researchers of research topics and methodologies which are recently attracting attention.

Research trends in the Korean Journal of Women Health Nursing from 2011 to 2021: a quantitative content analysis

  • Ju-Hee Nho;Sookkyoung Park
    • Women's Health Nursing
    • /
    • v.29 no.2
    • /
    • pp.128-136
    • /
    • 2023
  • Purpose: Topic modeling is a text mining technique that extracts concepts from textual data and uncovers semantic structures and potential knowledge frameworks within context. This study aimed to identify major keywords and network structures for each major topic to discern research trends in women's health nursing published in the Korean Journal of Women Health Nursing (KJWHN) using text network analysis and topic modeling. Methods: The study targeted papers with English abstracts among 373 articles published in KJWHN from January 2011 to December 2021. Text network analysis and topic modeling were employed, and the analysis consisted of five steps: (1) data collection, (2) word extraction and refinement, (3) extraction of keywords and creation of networks, (4) network centrality analysis and key topic selection, and (5) topic modeling. Results: Six major keywords, each corresponding to a topic, were extracted through topic modeling analysis: "gynecologic neoplasms," "menopausal health," "health behavior," "infertility," "women's health in transition," and "nursing education for women." Conclusion: The latent topics from the target studies primarily focused on the health of women across all age groups. Research related to women's health is evolving with changing times and warrants further progress in the future. Future research on women's health nursing should explore various topics that reflect changes in social trends, and research methods should be diversified accordingly.

A Design on Informal Big Data Topic Extraction System Based on Spark Framework (Spark 프레임워크 기반 비정형 빅데이터 토픽 추출 시스템 설계)

  • Park, Kiejin
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.521-526
    • /
    • 2016
  • As on-line informal text data have massive in its volume and have unstructured characteristics in nature, there are limitations in applying traditional relational data model technologies for data storage and data analysis jobs. Moreover, using dynamically generating massive social data, social user's real-time reaction analysis tasks is hard to accomplish. In the paper, to capture easily the semantics of massive and informal on-line documents with unsupervised learning mechanism, we design and implement automatic topic extraction systems according to the mass of the words that consists a document. The input data set to the proposed system are generated first, using N-gram algorithm to build multiple words to capture the meaning of the sentences precisely, and Hadoop and Spark (In-memory distributed computing framework) are adopted to run topic model. In the experiment phases, TB level input data are processed for data preprocessing and proposed topic extraction steps are applied. We conclude that the proposed system shows good performance in extracting meaningful topics in time as the intermediate results come from main memories directly instead of an HDD reading.

Extraction of Latent Topic-based Communities in Blogspace (블로그 월드에서 주제 중심의 잠재적 커뮤니티 추출 방안)

  • Shin, Jung-Hwan;Yoon, Seok-Ho;Kim, Sang-Wook;Park, Sun-Ju
    • Journal of KIISE:Databases
    • /
    • v.37 no.1
    • /
    • pp.56-69
    • /
    • 2010
  • In blogspace, there are posts that deal with a common topic and bloggers that are interested in these posts. In this paper, we define a blog community as a group of these bloggers and posts. With a blog community, we can establish various business policies for target marketing, sharing high quality data, and mobilizing the activities in the blogspace. Unlike internet cafes, bloggers participate in blog communities without explicit membership. So, it is not easy to identify the members of a community. In this paper, we propose an effective approach for extracting a blog community that is related to a given topic. First, we choose seed posts that is highly related to a given topic, and select bloggers that are related to the topic with the seed posts. Then, we select posts that are related to the topic with the selected bloggers. By repeating this, we find all the posts and bloggers that are members of the community related to a given topic in blogspace. We verify the superiority of the proposed approach by analyzing extracted blog communities.

Company Name Discrimination in Tweets using Topic Signatures Extracted from News Corpus

  • Hong, Beomseok;Kim, Yanggon;Lee, Sang Ho
    • Journal of Computing Science and Engineering
    • /
    • v.10 no.4
    • /
    • pp.128-136
    • /
    • 2016
  • It is impossible for any human being to analyze the more than 500 million tweets that are generated per day. Lexical ambiguities on Twitter make it difficult to retrieve the desired data and relevant topics. Most of the solutions for the word sense disambiguation problem rely on knowledge base systems. Unfortunately, it is expensive and time-consuming to manually create a knowledge base system, resulting in a knowledge acquisition bottleneck. To solve the knowledge-acquisition bottleneck, a topic signature is used to disambiguate words. In this paper, we evaluate the effectiveness of various features of newspapers on the topic signature extraction for word sense discrimination in tweets. Based on our results, topic signatures obtained from a snippet feature exhibit higher accuracy in discriminating company names than those from the article body. We conclude that topic signatures extracted from news articles improve the accuracy of word sense discrimination in the automated analysis of tweets.

A Development Method of Framework for Collecting, Extracting, and Classifying Social Contents

  • Cho, Eun-Sook
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.1
    • /
    • pp.163-170
    • /
    • 2021
  • As a big data is being used in various industries, big data market is expanding from hardware to infrastructure software to service software. Especially it is expanding into a huge platform market that provides applications for holistic and intuitive visualizations such as big data meaning interpretation understandability, and analysis results. Demand for big data extraction and analysis using social media such as SNS is very active not only for companies but also for individuals. However despite such high demand for the collection and analysis of social media data for user trend analysis and marketing, there is a lack of research to address the difficulty of dynamic interlocking and the complexity of building and operating software platforms due to the heterogeneity of various social media service interfaces. In this paper, we propose a method for developing a framework to operate the process from collection to extraction and classification of social media data. The proposed framework solves the problem of heterogeneous social media data collection channels through adapter patterns, and improves the accuracy of social topic extraction and classification through semantic association-based extraction techniques and topic association-based classification techniques.

Query Expansion based on Knowledge Extraction and Latent Dirichlet Allocation for Clinical Decision Support (의학 문서 검색을 위한 지식 추출 및 LDA 기반 질의 확장)

  • Jo, Seung-Hyeon;Lee, Kyung-Soon
    • Annual Conference on Human and Language Technology
    • /
    • 2015.10a
    • /
    • pp.31-34
    • /
    • 2015
  • 본 논문에서는 임상 의사 결정 지원을 위한 UMLS와 위키피디아를 이용하여 지식 정보를 추출하고 질의 유형 정보를 이용한 LDA 기반 질의 확장 방법을 제안한다. 질의로는 해당 환자가 겪고 있는 증상들이 주어진다. UMLS와 위키피디아를 사용하여 병명과 병과 관련된 증상, 검사 방법, 치료 방법 정보를 추출한다. UMLS와 위키피디아를 사용하여 추출한 의학 정보를 이용하여 질의와 관련된 병명을 추출한다. 질의와 관련된 병명을 이용하여 추가 증상, 검사 방법, 치료 방법 정보를 확장 질의로 선택한다. 또한, LDA를 실행한 후, Word-Topic 클러스터에서 질의와 관련된 클러스터를 추출하고 Document-Topic 클러스터에서 초기 검색 결과와 관련이 높은 클러스터를 추출한다. 추출한 Word-Topic 클러스터와 Document-Topic 클러스터 중 같은 번호를 가지고 있는 클러스터를 찾는다. 그 후, Word-Topic 클러스터에서 의학 용어를 추출하여 확장 질의로 선택한다. 제안 방법의 유효성을 검증하기 위해 TREC Clinical Decision Support(CDS) 2014 테스트 컬렉션에 대해 비교 평가한다.

  • PDF

Keyword Extraction from News Corpus using Modified TF-IDF (TF-IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법)

  • Lee, Sung-Jick;Kim, Han-Joon
    • The Journal of Society for e-Business Studies
    • /
    • v.14 no.4
    • /
    • pp.59-73
    • /
    • 2009
  • Keyword extraction is an important and essential technique for text mining applications such as information retrieval, text categorization, summarization and topic detection. A set of keywords extracted from a large-scale electronic document data are used for significant features for text mining algorithms and they contribute to improve the performance of document browsing, topic detection, and automated text classification. This paper presents a keyword extraction technique that can be used to detect topics for each news domain from a large document collection of internet news portal sites. Basically, we have used six variants of traditional TF-IDF weighting model. On top of the TF-IDF model, we propose a word filtering technique called 'cross-domain comparison filtering'. To prove effectiveness of our method, we have analyzed usefulness of keywords extracted from Korean news articles and have presented changes of the keywords over time of each news domain.

  • PDF