• Title/Summary/Keyword: Text clustering

Search Result 206, Processing Time 0.023 seconds

A Study on Differences of Contents and Tones of Arguments among Newspapers Using Text Mining Analysis (텍스트 마이닝을 활용한 신문사에 따른 내용 및 논조 차이점 분석)

  • Kam, Miah;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.3
    • /
    • pp.53-77
    • /
    • 2012
  • This study analyses the difference of contents and tones of arguments among three Korean major newspapers, the Kyunghyang Shinmoon, the HanKyoreh, and the Dong-A Ilbo. It is commonly accepted that newspapers in Korea explicitly deliver their own tone of arguments when they talk about some sensitive issues and topics. It could be controversial if readers of newspapers read the news without being aware of the type of tones of arguments because the contents and the tones of arguments can affect readers easily. Thus it is very desirable to have a new tool that can inform the readers of what tone of argument a newspaper has. This study presents the results of clustering and classification techniques as part of text mining analysis. We focus on six main subjects such as Culture, Politics, International, Editorial-opinion, Eco-business and National issues in newspapers, and attempt to identify differences and similarities among the newspapers. The basic unit of text mining analysis is a paragraph of news articles. This study uses a keyword-network analysis tool and visualizes relationships among keywords to make it easier to see the differences. Newspaper articles were gathered from KINDS, the Korean integrated news database system. KINDS preserves news articles of the Kyunghyang Shinmun, the HanKyoreh and the Dong-A Ilbo and these are open to the public. This study used these three Korean major newspapers from KINDS. About 3,030 articles from 2008 to 2012 were used. International, national issues and politics sections were gathered with some specific issues. The International section was collected with the keyword of 'Nuclear weapon of North Korea.' The National issues section was collected with the keyword of '4-major-river.' The Politics section was collected with the keyword of 'Tonghap-Jinbo Dang.' All of the articles from April 2012 to May 2012 of Eco-business, Culture and Editorial-opinion sections were also collected. All of the collected data were handled and edited into paragraphs. We got rid of stop-words using the Lucene Korean Module. We calculated keyword co-occurrence counts from the paired co-occurrence list of keywords in a paragraph. We made a co-occurrence matrix from the list. Once the co-occurrence matrix was built, we used the Cosine coefficient matrix as input for PFNet(Pathfinder Network). In order to analyze these three newspapers and find out the significant keywords in each paper, we analyzed the list of 10 highest frequency keywords and keyword-networks of 20 highest ranking frequency keywords to closely examine the relationships and show the detailed network map among keywords. We used NodeXL software to visualize the PFNet. After drawing all the networks, we compared the results with the classification results. Classification was firstly handled to identify how the tone of argument of a newspaper is different from others. Then, to analyze tones of arguments, all the paragraphs were divided into two types of tones, Positive tone and Negative tone. To identify and classify all of the tones of paragraphs and articles we had collected, supervised learning technique was used. The Na$\ddot{i}$ve Bayesian classifier algorithm provided in the MALLET package was used to classify all the paragraphs in articles. After classification, Precision, Recall and F-value were used to evaluate the results of classification. Based on the results of this study, three subjects such as Culture, Eco-business and Politics showed some differences in contents and tones of arguments among these three newspapers. In addition, for the National issues, tones of arguments on 4-major-rivers project were different from each other. It seems three newspapers have their own specific tone of argument in those sections. And keyword-networks showed different shapes with each other in the same period in the same section. It means that frequently appeared keywords in articles are different and their contents are comprised with different keywords. And the Positive-Negative classification showed the possibility of classifying newspapers' tones of arguments compared to others. These results indicate that the approach in this study is promising to be extended as a new tool to identify the different tones of arguments of newspapers.

Citizen Sentiment Analysis of the Social Disaster by Using Opinion Mining (오피니언 마이닝 기법을 이용한 사회적 재난의 시민 감성도 분석)

  • Seo, Min Song;Yoo, Hwan Hee
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.25 no.1
    • /
    • pp.37-46
    • /
    • 2017
  • Recently, disaster caused by social factors is frequently occurring in Korea. Prediction about what crisis could happen is difficult, raising the citizen's concern. In this study, we developed a program to acquire tweet data by applying Python language based Tweepy plug-in, regarding social disasters such as 'Nonspecific motive crimes' and 'Oxy' products. These data were used to evaluate psychological trauma and anxiety of citizens through the text clustering analysis and the opinion mining analysis of the R Studio program after natural language processing. In the analysis of the 'Oxy' case, the accident of Sewol ferry, the continual sale of Oxy products of the Oxy had the highest similarity and 'Nonspecific motive crimes', the coping measures of the government against unexpected incidents such as the 'incident' of the screen door, the accident of Sewol ferry and 'Nonspecific motive crime' due to misogyny in Busan, had the highest similarity. In addition, the average index of the Citizens sentiment score in Nonspecific motive crimes was more negative than that in the Oxy case by 11.61%p. Therefore, it is expected that the findings will be utilized to predict the mental health of citizens to prevent future accidents.

A Study on the Cerber-Type Ransomware Detection Model Using Opcode and API Frequency and Correlation Coefficient (Opcode와 API의 빈도수와 상관계수를 활용한 Cerber형 랜섬웨어 탐지모델에 관한 연구)

  • Lee, Gye-Hyeok;Hwang, Min-Chae;Hyun, Dong-Yeop;Ku, Young-In;Yoo, Dong-Young
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.10
    • /
    • pp.363-372
    • /
    • 2022
  • Since the recent COVID-19 Pandemic, the ransomware fandom has intensified along with the expansion of remote work. Currently, anti-virus vaccine companies are trying to respond to ransomware, but traditional file signature-based static analysis can be neutralized in the face of diversification, obfuscation, variants, or the emergence of new ransomware. Various studies are being conducted for such ransomware detection, and detection studies using signature-based static analysis and behavior-based dynamic analysis can be seen as the main research type at present. In this paper, the frequency of ".text Section" Opcode and the Native API used in practice was extracted, and the association between feature information selected using K-means Clustering algorithm, Cosine Similarity, and Pearson correlation coefficient was analyzed. In addition, Through experiments to classify and detect worms among other malware types and Cerber-type ransomware, it was verified that the selected feature information was specialized in detecting specific ransomware (Cerber). As a result of combining the finally selected feature information through the above verification and applying it to machine learning and performing hyper parameter optimization, the detection rate was up to 93.3%.

A Model for Evaluating Technology Importance of Patents under Incomplete Citation (불완전 인용정보 하에서의 특허의 기술적 중요도 평가 모형)

  • Kim, Heon;Baek, Dong-Hyun;Shin, Min-Ju;Han, Dong-Seok
    • Journal of Intelligence and Information Systems
    • /
    • v.14 no.2
    • /
    • pp.121-136
    • /
    • 2008
  • Although domestic research funding organizations require patented technologies as an outcome of financial aids, they have much difficulty in evaluating qualitative value of the patented technology due to lack of systematic methods. Especially, because citation data is not essential to patent application in Korea, it is very difficult to evaluate a patent using the incomplete citation data. This study proposes a method for evaluating technology importance of a patent when there is no or insufficient citation data in patents. The technology importance of a patent can be evaluated objectively and quantitatively by the proposed method which consists of 5 steps such as selection of a target patent, collection of related patents, preparation of key word vector, clustering patents, and technological importance assessment. The method was applied to a patent on 'user identification method for payment using mobile terminal' in order to evaluate technology importance and demonstrate how the method works.

  • PDF

An Effcient Two-Level Hybrid Signature File Method for Large Text Databases (대용량 텍스트 데이터베이스를 위한 효율적인 2단계 합성 요약 화일 방법)

  • Yoo, Jae-Soo;Gang, Hyeong-Il
    • The Transactions of the Korea Information Processing Society
    • /
    • v.4 no.4
    • /
    • pp.923-932
    • /
    • 1997
  • In this paper, we propose a two-level hybrid signature file method(THM) to dffciently deal with large txt databases that use a term discrimination concept.In addition, we apply Yoo's clustering scheme to the two-level hybeid signature file method. The clustering schme groups similar signatures together according to the similarity of the highly discriminatiory tems so that we may achive better performance on retrival. The space-time ana-lyhtical model of the proposed two-level hybrid method is provided. Based on the analytical model and experiments, we compare it with the exsting methods, i.e. the bit-sliced method(BM), the-level method(TM), and the hybrid method(HM). As a result, we show that THM achives the best retrival performance in a large database with 100,000 records when the mumber fo matching records is less than 160.

  • PDF

An Efficient Slant Correction for Handwritten Hangul Strings using Structural Properties (한글필기체의 구조적 특징을 이용한 효율적 기울기 보정)

  • 유대근;김경환
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.1_2
    • /
    • pp.93-102
    • /
    • 2003
  • A slant correction method for handwritten Korean strings based on analysis of stroke distribution, which effectively reflects structural properties of Korean characters, is presented in this paper. The method aims to deal with typical problems which have been frequently observed in slant correction of handwritten Korean strings with conventional approaches developed for English/European languages. Extracted strokes from a line of text image are classified into two clusters by applying the K-means clustering. Gaussian modeling is applied to each of the clusters and the slant angle is estimated from the model which represents the vertical strokes. Experimental results support the effectiveness of the proposed method. For the performance comparison 1,300 handwritten address string images were used, and the results show that the proposed method has more superior performance than other conventional approaches.

Study on Designing and Implementing Online Customer Analysis System based on Relational and Multi-dimensional Model (관계형 다차원모델에 기반한 온라인 고객리뷰 분석시스템의 설계 및 구현)

  • Kim, Keun-Hyung;Song, Wang-Chul
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.4
    • /
    • pp.76-85
    • /
    • 2012
  • Through opinion mining, we can analyze the degree of positive or negative sentiments that customers feel about important entities or attributes in online customer reviews. But, the limit of the opinion mining techniques is to provide only simple functions in analyzing the reviews. In this paper, we proposed novel techniques that can analyze the online customer reviews multi-dimensionally. The novel technique is to modify the existing OLAP techniques so that they can be applied to text data. The novel technique, that is, multi-dimensional analytic model consists of noun, adjective and document axes which are converted into four relational tables in relational database. The multi-dimensional analysis model would be new framework which can converge the existing opinion mining, information summarization and clustering algorithms. In this paper, we implemented the multi-dimensional analysis model and algorithms. we recognized that the system would enable us to analyze the online customer reviews more complexly.

Effective Streaming of XML Data for Wireless Broadcasting (무선 방송을 위한 효과적인 XML 스트리밍)

  • Park, Jun-Pyo;Park, Chang-Sup;Chung, Yon-Dohn
    • Journal of KIISE:Databases
    • /
    • v.36 no.1
    • /
    • pp.50-62
    • /
    • 2009
  • In wireless and mobile environments, data broadcasting is recognized as an effective way for data dissemination due to its benefits to bandwidth efficiency, energy-efficiency, and scalability. In this paper, we address the problem of delayed query processing raised by tree-based index structures in wireless broadcast environments, which increases the access time of the mobile clients. We propose a novel distributed index structure and a clustering strategy for streaming XML data which enable energy and latency-efficient broadcast of XML data. We first define the DIX node structure to implement a fully distributed index structure which contains tag name, attributes, and text content of an element as well as its corresponding indices. By exploiting the index information in the DIX node stream, a mobile client can access the wireless stream in a shorter latency. We also suggest a method of clustering DIX nodes in the stream, which can further enhance the performance of query processing over the stream in the mobile clients. Through extensive performance experiments, we demonstrate that our approach is effective for wireless broadcasting of XML data and outperforms the previous methods.

Modified multi-sense skip-gram using weighted context and x-means (가중 문맥벡터와 X-means 방법을 이용한 변형 다의어스킵그램)

  • Jeong, Hyunwoo;Lee, Eun Ryung
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.3
    • /
    • pp.389-399
    • /
    • 2021
  • In recent years, word embedding has been a popular field of natural language processing research and a skip-gram has become one successful word embedding method. It assigns a word embedding vector to each word using contexts, which provides an effective way to analyze text data. However, due to the limitation of vector space model, primary word embedding methods assume that every word only have a single meaning. As one faces multi-sense words, that is, words with more than one meaning, in reality, Neelakantan (2014) proposed a multi-sense skip-gram (MSSG) to find embedding vectors corresponding to the each senses of a multi-sense word using a clustering method. In this paper, we propose a modified method of the MSSG to improve statistical accuracy. Moreover, we propose a data-adaptive choice of the number of clusters, that is, the number of meanings for a multi-sense word. Some numerical evidence is given by conducting real data-based simulations.

Multi-Document Summarization Method of Reviews Using Word Embedding Clustering (워드 임베딩 클러스터링을 활용한 리뷰 다중문서 요약기법)

  • Lee, Pil Won;Hwang, Yun Young;Choi, Jong Seok;Shin, Young Tae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.11
    • /
    • pp.535-540
    • /
    • 2021
  • Multi-document refers to a document consisting of various topics, not a single topic, and a typical example is online reviews. There have been several attempts to summarize online reviews because of their vast amounts of information. However, collective summarization of reviews through existing summary models creates a problem of losing the various topics that make up the reviews. Therefore, in this paper, we present method to summarize the review with minimal loss of the topic. The proposed method classify reviews through processes such as preprocessing, importance evaluation, embedding substitution using BERT, and embedding clustering. Furthermore, the classified sentences generate the final summary using the trained Transformer summary model. The performance evaluation of the proposed model was compared by evaluating the existing summary model, seq2seq model, and the cosine similarity with the ROUGE score, and performed a high performance summary compared to the existing summary model.