• Title/Summary/Keyword: Big Data Clustering

Search Result 146, Processing Time 0.034 seconds

Technology Clustering Using Textual Information of Reference Titles in Scientific Paper (과학기술 논문의 참고문헌 텍스트 정보를 활용한 기술의 군집화)

  • Park, Inchae;Kim, Songhee;Yoon, Byungun
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.43 no.2
    • /
    • pp.25-32
    • /
    • 2020
  • Data on patent and scientific paper is considered as a useful information source for analyzing technological information and has been widely utilized. Technology big data is analyzed in various ways to identify the latest technological trends and predict future promising technologies. Clustering is one of the ways to discover new features by creating groups from technology big data. Patent includes refined bibliographic information such as patent classification code whereas scientific paper does not have appropriate bibliographic information for clustering. This research proposes a new approach for clustering data of scientific paper by utilizing reference titles in each scientific paper. In this approach, the reference titles are considered as textual information because each reference consists of the title of the paper that represents the core content of the paper. We collected the scientific paper data, extracted the title of the reference, and conducted clustering by measuring the text-based similarity. The results from the proposed approach are compared with the results using existing methodologies that one is the approach utilizing textual information from titles and abstracts and the other one is a citation-based approach. The suggested approach in this paper shows statistically significant difference compared to the existing approaches and it shows better clustering performance. The proposed approach will be considered as a useful method for clustering scientific papers.

Evaluating Conversion Rate from Advertising in Social Media using Big Data Clustering

  • Alyoubi, Khaled H.;Alotaibi, Fahd S.
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.7
    • /
    • pp.305-316
    • /
    • 2021
  • The objective is to recognize the better opportunities from targeted reveal advertising, to show a banner ad to the consumer of online who is most expected to obtain a preferred action like signing up for a newsletter or buying a product. Discovering the most excellent commercial impression, it means the chance to exhibit an advertisement to a consumer needs the capability to calculate the probability that the consumer who perceives the advertisement on the users browser will acquire an accomplishment, that is the consumer will convert. On the other hand, conversion possibility assessment is a demanding process since there is tremendous data growth across different information dimensions and the adaptation event occurs infrequently. Retailers and manufacturers extensively employ the retail services from internet as part of a multichannel distribution and promotion strategy. The rate at which web site visitors transfer to consumers is low for online retail, out coming in high customer acquisition expenses. Approximately 96 percent of web site users concluded exclusive of no shopper purchase[1].This category of conversion rate is collected from the advertising of social media sites and pages that dataset must be estimating and assessing with the concept of big data clustering, which is used to group the particular age group of people along with their behavior. This makes to identify the proper consumer of the production which leads to improve the profitability of the concern.

Customer Load Pattern Analysis using Clustering Techniques (클러스터링 기법을 이용한 수용가별 전력 데이터 패턴 분석)

  • Ryu, Seunghyoung;Kim, Hongseok;Oh, Doeun;No, Jaekoo
    • KEPCO Journal on Electric Power and Energy
    • /
    • v.2 no.1
    • /
    • pp.61-69
    • /
    • 2016
  • Understanding load patterns and customer classification is a basic step in analyzing the behavior of electricity consumers. To achieve that, there have been many researches about clustering customers' daily load data. Nowadays, the deployment of advanced metering infrastructure (AMI) and big-data technologies make it easier to study customers' load data. In this paper, we study load clustering from the view point of yearly and daily load pattern. We compare four clustering methods; K-means clustering, hierarchical clustering (average & Ward's method) and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). We also discuss the relationship between clustering results and Korean Standard Industrial Classification that is one of possible labels for customers' load data. We find that hierarchical clustering with Ward's method is suitable for clustering load data and KSIC can be well characterized by daily load pattern, but not quite well by yearly load pattern.

Performance Comparison of Clustering Validity Indices with Business Applications (경영사례를 이용한 군집화 유효성 지수의 성능비교)

  • Lee, Soo-Hyun;Jeong, Youngseon;Kim, Jae-Yun
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.41 no.2
    • /
    • pp.17-33
    • /
    • 2016
  • Clustering is one of the leading methods to analyze big data and is used in many different fields. This study deals with Clustering Validity Index (CVI) to verify the effectiveness of clustering results. We compare the performance of CVIs with business applications of various field. In this study, the used CVIs for comparing performance are DU, CH, DB, SVDU, SVCH, and SVDB. The first three CVIs are well-known ones in the existing research and the last three CVIs are based on support vector data description. It has been verified with outstanding performance and qualified as the application ability of CVIs based on support vector data description.

Improved TI-FCM Clustering Algorithm in Big Data (빅데이터에서 개선된 TI-FCM 클러스터링 알고리즘)

  • Lee, Kwang-Kyug
    • Journal of IKEEE
    • /
    • v.23 no.2
    • /
    • pp.419-424
    • /
    • 2019
  • The FCM algorithm finds the optimal solution through iterative optimization technique. In particular, there is a difference in execution time depending on the initial center of clustering, the location of noise, the location and number of crowded densities. However, this method gradually updates the center point, and the center of the initial cluster is shifted to one side. In this paper, we propose a TI-FCM(Triangular Inequality-Fuzzy C-Means) clustering algorithm that determines the cluster center density by maximizing the distance between clusters using triangular inequality. The proposed method is an effective method to converge to real clusters compared to FCM even in large data sets. Experiments show that execution time is reduced compared to existing FCM.

Image Deduplication Based on Hashing and Clustering in Cloud Storage

  • Chen, Lu;Xiang, Feng;Sun, Zhixin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.4
    • /
    • pp.1448-1463
    • /
    • 2021
  • With the continuous development of cloud storage, plenty of redundant data exists in cloud storage, especially multimedia data such as images and videos. Data deduplication is a data reduction technology that significantly reduces storage requirements and increases bandwidth efficiency. To ensure data security, users typically encrypt data before uploading it. However, there is a contradiction between data encryption and deduplication. Existing deduplication methods for regular files cannot be applied to image deduplication because images need to be detected based on visual content. In this paper, we propose a secure image deduplication scheme based on hashing and clustering, which combines a novel perceptual hash algorithm based on Local Binary Pattern. In this scheme, the hash value of the image is used as the fingerprint to perform deduplication, and the image is transmitted in an encrypted form. Images are clustered to reduce the time complexity of deduplication. The proposed scheme can ensure the security of images and improve deduplication accuracy. The comparison with other image deduplication schemes demonstrates that our scheme has somewhat better performance.

Clustering Validity of Social Network Subgroup Using Attribute Similarity (속성유사도에 따른 사회연결망 서브그룹의 군집유효성)

  • Yoon, Han-Seong
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.17 no.1
    • /
    • pp.75-84
    • /
    • 2021
  • For analyzing big data, the social network is increasingly being utilized through relational data, which means the connection characteristics between entities such as people and objects. When the relational data does not exist directly, a social network can be configured by calculating relational data such as attribute similarity from attribute data of entities and using it as links. In this paper, the composition method of the social network using the attribute similarity between entities as a connection relationship, and the clustering method using subgroups for the configured social network are suggested, and the clustering effectiveness of the clustering results is evaluated. The analysis results can vary depending on the type and characteristics of the data to be analyzed, the type of attribute similarity selected, and the criterion value. In addition, the clustering effectiveness may not be consistent depending on the its evaluation method. Therefore, selections and experiments are necessary for better analysis results. Since the analysis results may be different depending on the type and characteristics of the analysis target, options for clustering, etc., there is a limitation. In addition, for performance evaluation of clustering, a study is needed to compare the method of this paper with the conventional method such as k-means.

Incidence of Online Public Opinion on Guangzhou Simultaneous Renting and Purchasing Policy - A data mining application

  • Wang, Yancheng;Li, Haixian
    • Asian Journal for Public Opinion Research
    • /
    • v.5 no.4
    • /
    • pp.266-284
    • /
    • 2018
  • This paper adopts the big data research method, and draws 491 data from the Tianya Forum about the Simultaneous Renting and Purchasing policy of Guangzhou. The qualitative analysis software Nvivo11 is used to cluster the main questions about the Simultaneous Renting and Purchasing policy in the forum. The 36 high-frequency word frequencies are obtained through text clustering. Through rooted theory analysis, the main driving factors for summarizing people's doubts are 9 main categories, 3 core categories, and the model of driving factors for online forums is established. The study finds that resource factors are the most key factor, economic factors are the important drivers, and policy guiding factors are sub-important drivers.

KOREAN TOPIC MODELING USING MATRIX DECOMPOSITION

  • June-Ho Lee;Hyun-Min Kim
    • East Asian mathematical journal
    • /
    • v.40 no.3
    • /
    • pp.307-318
    • /
    • 2024
  • This paper explores the application of matrix factorization, specifically CUR decomposition, in the clustering of Korean language documents by topic. It addresses the unique challenges of Natural Language Processing (NLP) in dealing with the Korean language's distinctive features, such as agglutinative words and morphological ambiguity. The study compares the effectiveness of Latent Semantic Analysis (LSA) using CUR decomposition with the classical Singular Value Decomposition (SVD) method in the context of Korean text. Experiments are conducted using Korean Wikipedia documents and newspaper data, providing insight into the accuracy and efficiency of these techniques. The findings demonstrate the potential of CUR decomposition to improve the accuracy of document clustering in Korean, offering a valuable approach to text mining and information retrieval in agglutinative languages.

A Study on the Visiting Areas Classification of Cargo Vehicles Using Dynamic Clustering Method (화물차량의 방문시설 공간설정 방법론 연구)

  • Bum Chul Cho;Eun A Cho
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.22 no.6
    • /
    • pp.141-156
    • /
    • 2023
  • This study aims to improve understanding of freight movement, crucial for logistics facility investment and policy making. It addresses the limitations of traditional freight truck traffic data, aggregated only at city and county levels, by developing a new methodology. This method uses trip chain data for more detailed, facility-level analysis of freight truck movements. It employs DTG (Digital Tachograph) data to identify individual truck visit locations and creates H3 system-based polygons to represent these visits spatially. The study also involves an algorithm to dynamically determine the optimal spatial resolution of these polygons. Tested nationally, the approach resulted in polygons with 81.26% spatial fit and 14.8% error rate, offering insights into freight characteristics and enabling clustering based on traffic chain characteristics of freight trucks and visited facility types.