• Title/Summary/Keyword: news articles clustering

Search Result 25, Processing Time 0.028 seconds

Hierarchical and Incremental Clustering for Semi Real-time Issue Analysis on News Articles (준 실시간 뉴스 이슈 분석을 위한 계층적·점증적 군집화)

  • Kim, Hoyong;Lee, SeungWoo;Jang, Hong-Jun;Seo, DongMin
    • The Journal of the Korea Contents Association
    • /
    • v.20 no.6
    • /
    • pp.556-578
    • /
    • 2020
  • There are many different researches about how to analyze issues based on real-time news streams. But, there are few researches which analyze issues hierarchically from news articles and even a previous research of hierarchical issue analysis make clustering speed slower as the increment of news articles. In this paper, we propose a hierarchical and incremental clustering for semi real-time issue analysis on news articles. We trained siamese neural network based weighted cosine similarity model, applied this model to k-means algorithm which is used to make word clusters and converted news articles to document vectors by using these word clusters. Finally, we initialized an issue cluster tree from document vectors, updated this tree whenever news articles happen, and analyzed issues in semi real-time. Through the experiment and evaluation, we showed that up to about 0.26 performance has been improved in terms of NMI. Also, in terms of speed of incremental clustering, we also showed about 10 times faster than before.

User Oriented clustering of news articles using Tweets Heterogeneous Information Network (트위트 이형 정보 망을 이용한 뉴스 기사의 사용자 지향적 클러스터링)

  • Shoaib, Muhammad;Song, Wang-Cheol
    • Journal of Internet Computing and Services
    • /
    • v.14 no.6
    • /
    • pp.85-94
    • /
    • 2013
  • With the emergence of world wide web, in particular web 2.0 the rapidly growing amount of news articles has created a problem for users in selection of news articles according to their requirements. To overcome this problem different clustering mechanism has been proposed to broadly categorize news articles. However these techniques are totally machine oriented techniques and lack users' participation in the process of decision making for membership of clustering. In order to overcome the issue of zero-participation in the process of clustering news articles in this paper we have proposed a framework for clustering news articles by combining users' judgments that they post on twitter with the news articles to cluster the objects. We have employed twitter hash-tags for this purpose. Furthermore we have computed the credibility of users' based on frequency of retweets for their tweets in order to enhance the accuracy of the clustering membership function. In order to test performance of proposed methodology, we performed experiments on tweets messages tweeted during general election 2013 in Pakistan. Our results proved over claim that using users' output better outcome can be achieved then ordinary clustering algorithms.

Analysis of News Articles on Child Welfare Policies in South Korea: K-Means Clustering (대한민국 정권별 아동복지정책 관련 뉴스 기사 분석: K-평균 군집 분석)

  • Kim, Eun Joo;Kim, Seong Kwang;Park, Bit Na
    • Journal of East-West Nursing Research
    • /
    • v.29 no.2
    • /
    • pp.185-195
    • /
    • 2023
  • Purpose: The purpose of this study is to analyze changes of child welfare policies and provide insights based on the collection and classification of newspaper articles. Methods: Articles related to child welfare policies were collected from 1990, during the Kim, Young-sam administration, to May 9, 2022, under the Moon, Jae-in administration. K-Means clustering and keyword Term Frequency-Inverse Document Frequency analysis were utilized to cluster and analyze newspaper articles with similar themes. Results: The administrations of Kim, Young-sam, Kim, Dae-jung, Roh, Moo-hyun, and Park, Geun-hye were classified into two clusters, and the Lee, Myung-bak and Moon, Jae-in administrations were classified into three clusters. Conclusion: South Korea's child welfare policies have focused on ensuring the safety and healthy development of children through diverse policies initiatives over the years. However, challenges related to child protection and child abuse persist. This requires additional resources and budget allocation. It is important to establish a comprehensive support system for children and families, including comprehensive nursing support.

A Study on an Effective Event Detection Method for Event-Focused News Summarization (사건중심 뉴스기사 자동요약을 위한 사건탐지 기법에 관한 연구)

  • Chung, Young-Mee;Kim, Yong-Kwang
    • Journal of the Korean Society for information Management
    • /
    • v.25 no.4
    • /
    • pp.227-243
    • /
    • 2008
  • This study investigates an event detection method with the aim of generating an event-focused news summary from a set of news articles on a certain event using a multi-document summarization technique. The event detection method first classifies news articles into the event related topic categories by employing a SVM classifier and then creates event clusters containing news articles on an event by a modified single pass clustering algorithm. The clustering algorithm applies a time penalty function as well as cluster partitioning to enhance the clustering performance. It was found that the event detection method proposed in this study showed a satisfactory performance in terms of both the F-measure and the detection cost.

Contextual Advertisement System based on Document Clustering (문서 클러스터링을 이용한 문맥 광고 시스템)

  • Lee, Dong-Kwang;Kang, In-Ho;An, Dong-Un
    • The KIPS Transactions:PartB
    • /
    • v.15B no.1
    • /
    • pp.73-80
    • /
    • 2008
  • In this paper, an advertisement-keyword finding method using document clustering is proposed to solve problems by ambiguous words and incorrect identification of main keywords. News articles that have similar contents and the same advertisement-keywords are clustered to construct the contextual information of advertisement-keywords. In addition to news articles, the web page and summary of a product are also used to construct the contextual information. The given document is classified as one of the news article clusters, and then cluster-relevant advertisement-keywords are used to identify keywords in the document. We could achieve 21% precision improvement by our proposed method.

A Heuristic Method of In-situ Drought Using Mass Media Information

  • Lee, Jiwan;Kim, Seong-Joon
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2020.06a
    • /
    • pp.168-168
    • /
    • 2020
  • This study is to evaluate the drought-related bigdata characteristics published from South Korean by developing crawler. The 5 years (2013 ~ 2017) drought-related posted articles were collected from Korean internet search engine 'NAVER' which contains 13 main and 81 local daily newspapers. During the 5 years period, total 40,219 news articles including 'drought' word were found using crawler. To filter the homonyms liken drought to soccer goal drought in sports, money drought economics, and policy drought in politics often used in South Korea, the quality control was processed and 47.8 % articles were filtered. After, the 20,999 (52.2 %) drought news articles of this study were classified into four categories of water deficit (WD), water security and support (WSS), economic damage and impact (EDI), and environmental and sanitation impact (ESI) with 27, 15, 13, and 18 drought-related keywords in each category. The WD, WSS, EDI, and ESI occupied 41.4 %, 34.5 %, 14.8 %, and 9.3 % respectively. The drought articles were mostly posted in June 2015 and June 2017 with 22.7 % (15,097) and 15.9 % (10,619) respectively. The drought news articles were spatiotemporally compared with SPI (Standardized Precipitation Index) and RDI (Reservoir Drought Index) were calculated. They were classified into administration boundaries of 8 main cities and 9 provinces in South Korea because the drought response works based on local government unit. The space-time clustering between news articles (WD, WSS, EDI, and ESI) and indices (SPI and RDI) were tried how much they have correlation each other. The spatiotemporal clusters detection was applied using SaTScan software (Kulldorff, 2015). The retrospective and prospective cluster analyses were conducted for past and present time to understand how much they are intensive in clusters. The news articles of WD, WSS and EDI had strong clusters in provinces, and ESI in cities.

  • PDF

User-Perspective Issue Clustering Using Multi-Layered Two-Mode Network Analysis (다계층 이원 네트워크를 활용한 사용자 관점의 이슈 클러스터링)

  • Kim, Jieun;Kim, Namgyu;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.93-107
    • /
    • 2014
  • In this paper, we report what we have observed with regard to user-perspective issue clustering based on multi-layered two-mode network analysis. This work is significant in the context of data collection by companies about customer needs. Most companies have failed to uncover such needs for products or services properly in terms of demographic data such as age, income levels, and purchase history. Because of excessive reliance on limited internal data, most recommendation systems do not provide decision makers with appropriate business information for current business circumstances. However, part of the problem is the increasing regulation of personal data gathering and privacy. This makes demographic or transaction data collection more difficult, and is a significant hurdle for traditional recommendation approaches because these systems demand a great deal of personal data or transaction logs. Our motivation for presenting this paper to academia is our strong belief, and evidence, that most customers' requirements for products can be effectively and efficiently analyzed from unstructured textual data such as Internet news text. In order to derive users' requirements from textual data obtained online, the proposed approach in this paper attempts to construct double two-mode networks, such as a user-news network and news-issue network, and to integrate these into one quasi-network as the input for issue clustering. One of the contributions of this research is the development of a methodology utilizing enormous amounts of unstructured textual data for user-oriented issue clustering by leveraging existing text mining and social network analysis. In order to build multi-layered two-mode networks of news logs, we need some tools such as text mining and topic analysis. We used not only SAS Enterprise Miner 12.1, which provides a text miner module and cluster module for textual data analysis, but also NetMiner 4 for network visualization and analysis. Our approach for user-perspective issue clustering is composed of six main phases: crawling, topic analysis, access pattern analysis, network merging, network conversion, and clustering. In the first phase, we collect visit logs for news sites by crawler. After gathering unstructured news article data, the topic analysis phase extracts issues from each news article in order to build an article-news network. For simplicity, 100 topics are extracted from 13,652 articles. In the third phase, a user-article network is constructed with access patterns derived from web transaction logs. The double two-mode networks are then merged into a quasi-network of user-issue. Finally, in the user-oriented issue-clustering phase, we classify issues through structural equivalence, and compare these with the clustering results from statistical tools and network analysis. An experiment with a large dataset was performed to build a multi-layer two-mode network. After that, we compared the results of issue clustering from SAS with that of network analysis. The experimental dataset was from a web site ranking site, and the biggest portal site in Korea. The sample dataset contains 150 million transaction logs and 13,652 news articles of 5,000 panels over one year. User-article and article-issue networks are constructed and merged into a user-issue quasi-network using Netminer. Our issue-clustering results applied the Partitioning Around Medoids (PAM) algorithm and Multidimensional Scaling (MDS), and are consistent with the results from SAS clustering. In spite of extensive efforts to provide user information with recommendation systems, most projects are successful only when companies have sufficient data about users and transactions. Our proposed methodology, user-perspective issue clustering, can provide practical support to decision-making in companies because it enhances user-related data from unstructured textual data. To overcome the problem of insufficient data from traditional approaches, our methodology infers customers' real interests by utilizing web transaction logs. In addition, we suggest topic analysis and issue clustering as a practical means of issue identification.

Table based Single Pass Algorithm for Clustering News Articles

  • Jo, Tae-Ho
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.8 no.3
    • /
    • pp.231-237
    • /
    • 2008
  • This research proposes a modified version of single pass algorithm specialized for text clustering. Encoding documents into numerical vectors for using the traditional version of single pass algorithm causes the two main problems: huge dimensionality and sparse distribution. Therefore, in order to address the two problems, this research modifies the single pass algorithm into its version where documents are encoded into not numerical vectors but other forms. In the proposed version, documents are mapped into tables and the operation on two tables is defined for using the single pass algorithm. The goal of this research is to improve the performance of single pass algorithm for text clustering by modifying it into the specialized version.

Arabic Stock News Sentiments Using the Bidirectional Encoder Representations from Transformers Model

  • Eman Alasmari;Mohamed Hamdy;Khaled H. Alyoubi;Fahd Saleh Alotaibi
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.2
    • /
    • pp.113-123
    • /
    • 2024
  • Stock market news sentiment analysis (SA) aims to identify the attitudes of the news of the stock on the official platforms toward companies' stocks. It supports making the right decision in investing or analysts' evaluation. However, the research on Arabic SA is limited compared to that on English SA due to the complexity and limited corpora of the Arabic language. This paper develops a model of sentiment classification to predict the polarity of Arabic stock news in microblogs. Also, it aims to extract the reasons which lead to polarity categorization as the main economic causes or aspects based on semantic unity. Therefore, this paper presents an Arabic SA approach based on the logistic regression model and the Bidirectional Encoder Representations from Transformers (BERT) model. The proposed model is used to classify articles as positive, negative, or neutral. It was trained on the basis of data collected from an official Saudi stock market article platform that was later preprocessed and labeled. Moreover, the economic reasons for the articles based on semantic unit, divided into seven economic aspects to highlight the polarity of the articles, were investigated. The supervised BERT model obtained 88% article classification accuracy based on SA, and the unsupervised mean Word2Vec encoder obtained 80% economic-aspect clustering accuracy. Predicting polarity classification on the Arabic stock market news and their economic reasons would provide valuable benefits to the stock SA field.

TRIB : A Clustering and Visualization System for Responding Comments on Blogs (TRIB: 블로그 댓글 분류 및 시각화 시스템)

  • Lee, Yun-Jung;Ji, Jung-Hoon;Woo, Gyun;Cho, Hwan-Gue
    • The KIPS Transactions:PartD
    • /
    • v.16D no.5
    • /
    • pp.817-824
    • /
    • 2009
  • In recent years, Weblog has become the most typical social media for citizens to share their opinions. And, many Weblogs reflect several social issues. There are many internet users who actively express their opinions for internet news or Weblog articles through the replying comments on online community. Hence, we can easily find internet blogs including more than 10 thousand replying comments. It is hard to search and explore useful messages on weblogs since most of weblog systems show articles and their comments to the form of sequential list. In this paper, we propose a visualizing and clustering system called TRIB (Telescope for Responding comments for Internet Blog) for a large set of responding comments for a Weblog article. TRIB clusters and visualizes the replying comments considering their contents using pre-defined user dictionary. Also, TRIB provides various personalized views considering the interests of users. To show the usefulness of TRIB, we conducted some experiments, concerning the clustering and visualizing capabilities of TRIB, with articles that have more than 1,000 comments.