• Title/Summary/Keyword: Data Clustering

Search Result 2,747, Processing Time 0.042 seconds

A Parameter-Free Approach for Clustering and Outlier Detection in Image Databases (이미지 데이터베이스에서 매개변수를 필요로 하지 않는 클러스터링 및 아웃라이어 검출 방법)

  • Oh, Hyun-Kyo;Yoon, Seok-Ho;Kim, Sang-Wook
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.47 no.1
    • /
    • pp.80-91
    • /
    • 2010
  • As the volume of image data increases dramatically, its good organization of image data is crucial for efficient image retrieval. Clustering is a typical way of organizing image data. However, traditional clustering methods have a difficulty of requiring a user to provide the number of clusters as a parameter before clustering. In this paper, we discuss an approach for clustering image data that does not require the parameter. Basically, the proposed approach is based on Cross-Association that finds a structure or patterns hidden in data using the relationship between individual objects. In order to apply Cross-Association to clustering of image data, we convert the image data into a graph first. Then, we perform Cross-Association on the graph thus obtained and interpret the results in the clustering perspective. We also propose the method of hierarchical clustering and the method of outlier detection based on Cross-Association. By performing a series of experiments, we verify the effectiveness of the proposed approach. Finally, we discuss the finding of a good value of k used in k-nearest neighbor search and also compare the clustering results with symmetric and asymmetric ways used in building a graph.

RHadoop platform for K-Means clustering of big data (빅데이터 K-평균 클러스터링을 위한 RHadoop 플랫폼)

  • Shin, Ji Eun;Oh, Yoon Sik;Lim, Dong Hoon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.3
    • /
    • pp.609-619
    • /
    • 2016
  • RHadoop is a collection of R packages that allow users to manage and analyze data with Hadoop. In this paper, we implement K-Means algorithm based on MapReduce framework with RHadoop to make the clustering method applicable to large scale data. The main idea introduces a combiner as a function of our map output to decrease the amount of data needed to be processed by reducers. We showed that our K-Means algorithm using RHadoop with combiner was faster than regular algorithm without combiner as the size of data set increases. We also implemented Elbow method with MapReduce for finding the optimum number of clusters for K-Means clustering on large dataset. Comparison with our MapReduce implementation of Elbow method and classical kmeans() in R with small data showed similar results.

Cluster analysis by month for meteorological stations using a gridded data of numerical model with temperatures and precipitation (기온과 강수량의 수치모델 격자자료를 이용한 기상관측지점의 월별 군집화)

  • Kim, Hee-Kyung;Kim, Kwang-Sub;Lee, Jae-Won;Lee, Yung-Seop
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.5
    • /
    • pp.1133-1144
    • /
    • 2017
  • Cluster analysis with meteorological data allows to segment meteorological region based on meteorological characteristics. By the way, meteorological observed data are not adequate for cluster analysis because meteorological stations which observe the data are located not uniformly. Therefore the clustering of meteorological observed data cannot reflect the climate characteristic of South Korea properly. The clustering of $5km{\times}5km$ gridded data derived from a numerical model, on the other hand, reflect it evenly. In this study, we analyzed long-term grid data for temperatures and precipitation using cluster analysis. Due to the monthly difference of climate characteristics, clustering was performed by month. As the result of K-Means cluster analysis is so sensitive to initial values, we used initial values with Ward method which is hierarchical cluster analysis method. Based on clustering of gridded data, cluster of meteorological stations were determined. As a result, clustering of meteorological stations in South Korea has been made spatio-temporal segmentation.

A Performance Improvement Study On Hierarchical Clustering (Centroid Linkage) Using A Priority Queue (Priority Queue 를 이용한 Hierarchical Clustering (Centroid Linkage) 성능 개선)

  • Jeon, Yongkweon;Yoon, Sungroh
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.1837-1838
    • /
    • 2010
  • 기존 hierarchical clustering 은 Time complexity 와 space complexity 가 Large data set 을 clustering 하기에는 적당하지 못하며 이것을 일반 PC 의 메모리 내에서 해결하는데 어려움이 있다. 따라서 본 연구에서는 이러한 어려움을 극복하기 위해 기존 Hierarchical clustering 중 Centroid Linkage 에 새로운 Algorithm 을 제안하여 보다 적은 메모리를 사용하고 빠르게 처리하는 방법을 제안하고자 한다.

A New Approach to Spatial Pattern Clustering based on Longest Common Subsequence with application to a Grocery (공간적 패턴클러스터링을 위한 새로운 접근방법의 제안 : 슈퍼마켓고객의 동선분석)

  • Jung, In-Chul;Kwon, Young-S.
    • IE interfaces
    • /
    • v.24 no.4
    • /
    • pp.447-456
    • /
    • 2011
  • Identifying the major moving patterns of shoppers' movements in the selling floor has been a longstanding issue in the retailing industry. With the advent of RFID technology, it has been easier to collect the moving data for a individual shopper's movement. Most of the previous studies used the traditional clustering technique to identify the major moving pattern of customers. However, in using clustering technique, due to the spatial constraint (aisle layout or other physical obstructions in the store), standard clustering methods are not feasible for moving data like shopping path should be adjusted for the analysis in advance, which is time-consuming and causes data distortion. To alleviate this problems, we propose a new approach to spatial pattern clustering based on longest common subsequence (LCSS). Experimental results using the real data obtained from a grocery in Seoul show that the proposed method performs well in finding the hot spot and dead spot as well as in finding the major path patterns of customer movements.

An eigenspace projection clustering method for structural damage detection

  • Zhu, Jun-Hua;Yu, Ling;Yu, Li-Li
    • Structural Engineering and Mechanics
    • /
    • v.44 no.2
    • /
    • pp.179-196
    • /
    • 2012
  • An eigenspace projection clustering method is proposed for structural damage detection by combining projection algorithm and fuzzy clustering technique. The integrated procedure includes data selection, data normalization, projection, damage feature extraction, and clustering algorithm to structural damage assessment. The frequency response functions (FRFs) of the healthy and the damaged structure are used as initial data, median values of the projections are considered as damage features, and the fuzzy c-means (FCM) algorithm are used to categorize these features. The performance of the proposed method has been validated using a three-story frame structure built and tested by Los Alamos National Laboratory, USA. Two projection algorithms, namely principal component analysis (PCA) and kernel principal component analysis (KPCA), are compared for better extraction of damage features, further six kinds of distances adopted in FCM process are studied and discussed. The illustrated results reveal that the distance selection depends on the distribution of features. For the optimal choice of projections, it is recommended that the Cosine distance is used for the PCA while the Seuclidean distance and the Cityblock distance suitably used for the KPCA. The PCA method is recommended when a large amount of data need to be processed due to its higher correct decisions and less computational costs.

Emergent damage pattern recognition using immune network theory

  • Chen, Bo;Zang, Chuanzhi
    • Smart Structures and Systems
    • /
    • v.8 no.1
    • /
    • pp.69-92
    • /
    • 2011
  • This paper presents an emergent pattern recognition approach based on the immune network theory and hierarchical clustering algorithms. The immune network allows its components to change and learn patterns by changing the strength of connections between individual components. The presented immune-network-based approach achieves emergent pattern recognition by dynamically generating an internal image for the input data patterns. The members (feature vectors for each data pattern) of the internal image are produced by an immune network model to form a network of antibody memory cells. To classify antibody memory cells to different data patterns, hierarchical clustering algorithms are used to create an antibody memory cell clustering. In addition, evaluation graphs and L method are used to determine the best number of clusters for the antibody memory cell clustering. The presented immune-network-based emergent pattern recognition (INEPR) algorithm can automatically generate an internal image mapping to the input data patterns without the need of specifying the number of patterns in advance. The INEPR algorithm has been tested using a benchmark civil structure. The test results show that the INEPR algorithm is able to recognize new structural damage patterns.

High-Dimensional Clustering Technique using Incremental Projection (점진적 프로젝션을 이용한 고차원 글러스터링 기법)

  • Lee, Hye-Myung;Park, Young-Bae
    • Journal of KIISE:Databases
    • /
    • v.28 no.4
    • /
    • pp.568-576
    • /
    • 2001
  • Most of clustering algorithms data to degenerate rapidly on high dimensional spaces. Moreover, high dimensional data often contain a significant a significant of noise. which causes additional ineffectiveness of algorithms. Therefore it is necessary to develop algorithms adapted to the structure and characteristics of the high dimensional data. In this paper, we propose a clustering algorithms CLIP using the projection The CLIP is designed to overcome efficiency and/or effectiveness problems on high dimensional clustering and it is the is based on clustering on each one dimensional subspace but we use the incremental projection to recover high dimensional cluster and to reduce the computational cost significantly at time To evaluate the performance of CLIP we demonstrate is efficiency and effectiveness through a series of experiments on synthetic data sets.

  • PDF

Dynamic Subspace Clustering for Online Data Streams (온라인 데이터 스트림에서의 동적 부분 공간 클러스터링 기법)

  • Park, Nam Hun
    • Journal of Digital Convergence
    • /
    • v.20 no.2
    • /
    • pp.217-223
    • /
    • 2022
  • Subspace clustering for online data streams requires a large amount of memory resources as all subsets of data dimensions must be examined. In order to track the continuous change of clusters for a data stream in a finite memory space, in this paper, we propose a grid-based subspace clustering algorithm that effectively uses memory resources. Given an n-dimensional data stream, the distribution information of data items in data space is monitored by a grid-cell list. When the frequency of data items in the grid-cell list of the first level is high and it becomes a unit grid-cell, the grid-cell list of the next level is created as a child node in order to find clusters of all possible subspaces from the grid-cell. In this way, a maximum n-level grid-cell subspace tree is constructed, and a k-dimensional subspace cluster can be found at the kth level of the subspace grid-cell tree. Through experiments, it was confirmed that the proposed method uses computing resources more efficiently by expanding only the dense space while maintaining the same accuracy as the existing method.

Maximizing Information Transmission for Energy Harvesting Sensor Networks by an Uneven Clustering Protocol and Energy Management

  • Ge, Yujia;Nan, Yurong;Chen, Yi
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.4
    • /
    • pp.1419-1436
    • /
    • 2020
  • For an energy harvesting sensor network, when the network lifetime is not the only primary goal, maximizing the network performance under environmental energy harvesting becomes a more critical issue. However, clustering protocols that aim at providing maximum information throughput have not been thoroughly explored in Energy Harvesting Wireless Sensor Networks (EH-WSNs). In this paper, clustering protocols are studied for maximizing the data transmission in the whole network. Based on a long short-term memory (LSTM) energy predictor and node energy consumption and supplement models, an uneven clustering protocol is proposed where the cluster head selection and cluster size control are thoroughly designed for this purpose. Simulations and results verify that the proposed scheme can outperform some classic schemes by having more data packets received by the cluster heads (CHs) and the base station (BS) under these energy constraints. The outcomes of this paper also provide some insights for choosing clustering routing protocols in EH-WSNs, by exploiting the factors such as uneven clustering size, number of clusters, multiple CHs, multihop routing strategy, and energy supplementing period.