• Title/Summary/Keyword: Data Clustering

Search Result 2,769, Processing Time 0.029 seconds

ASVMRT: Materialized View Selection Algorithm in Data Warehouse

  • Yang, Jin-Hyuk;Chung, In-Jeong
    • Journal of Information Processing Systems
    • /
    • v.2 no.2
    • /
    • pp.67-75
    • /
    • 2006
  • In order to acquire a precise and quick response to an analytical query, proper selection of the views to materialize in the data warehouse is crucial. In traditional view selection algorithms, all relations are considered for selection as materialized views. However, materializing all relations rather than a part results in much worse performance in terms of time and space costs. Therefore, we present an improved algorithm for selection of views to materialize using the clustering method to overcome the problem resulting from conventional view selection algorithms. In the presented algorithm, ASVMRT (Algorithm for Selection of Views to Materialize using Reduced Table), we first generate reduced tables in the data warehouse using clustering based on attribute-values density, and then we consider the combination of reduced tables as materialized views instead of a combination of the original base relations. For the justification of the proposed algorithm, we reveal the experimental results in which both time and space costs are approximately 1.8 times better than conventional algorithms.

A Pattern Consistency Index for Detecting Heterogeneous Time Series in Clustering Time Course Gene Expression Data (시간경로 유전자 발현자료의 군집분석에서 이질적인 시계열의 탐지를 위한 패턴일치지수)

  • Son, Young-Sook;Baek, Jang-Sun
    • The Korean Journal of Applied Statistics
    • /
    • v.18 no.2
    • /
    • pp.371-379
    • /
    • 2005
  • In this paper, we propose a pattern consistency index for detecting heterogeneous time series that deviate from the representative pattern of each cluster in clustering time course gene expression data using the Pearson correlation coefficient. We examine its usefulness by applying this index to serum time course gene expression data from microarrays.

Parallel Structure Modeling of Nonlinear Process Using Clustering Method (클러스터링 기법을 이용한 비선형 공정의 병렬구조 모델링)

  • 박춘성;최재호;오성권;안태천
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1997.10a
    • /
    • pp.383-386
    • /
    • 1997
  • In this paper, We proposed a parallel structure of the Neural Network model to nonlinear complex system. Neural Network was used as basic model which has learning ability and high tolerence level. This paper, we used Neural Network which has BP(Error Back Propagation Algorithm) model. But it sometimes has difficulty to append characteristic of input data to nonlinear system. So that, I used HCM(hard c-Means) method of clustering technique to append property of input data. Clustering Algorithms are used extensively not only to organized categorize data, but are also useful for data compression and model construction. Gas furance, a sewage treatment process are used to evaluate the performance of the proposed model and then obtained higher accuracy than other previous medels.

  • PDF

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm (K-means 알고리즘 기반 클러스터링 인덱스 비교 연구)

  • Shim, Yo-Sung;Chung, Ji-Won;Choi, In-Chan
    • Asia pacific journal of information systems
    • /
    • v.16 no.1
    • /
    • pp.127-144
    • /
    • 2006
  • The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.

Data Fusion, Ensemble and Clustering for the Severity Classification of Road Traffic Accident in Korea (데이터융합, 앙상블과 클러스터링을 이용한 교통사고 심각도 분류분석)

  • Sohn, So-Young;Lee, Sung-Ho
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.26 no.4
    • /
    • pp.354-362
    • /
    • 2000
  • Increasing amount of road tragic in 90's has drawn much attention in Korea due to its influence on safety problems. Various types of data analyses are done in order to analyze the relationship between the severity of road traffic accident and driving conditions based on traffic accident records. Accurate results of such accident data analysis can provide crucial information for road accident prevention policy. In this paper, we apply several data fusion, ensemble and clustering algorithms in an effort to increase the accuracy of individual classifiers for the accident severity. An empirical study results indicated that clustering works best for road traffic accident classification in Korea.

  • PDF

A Study on the Integration Between Smart Mobility Technology and Information Communication Technology (ICT) Using Patent Analysis

  • Alkaabi, Khaled Sulaiman Khalfan Sulaiman;Yu, Jiwon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.6
    • /
    • pp.89-97
    • /
    • 2019
  • This study proposes a method for investigating current patents related to information communication technology and smart mobility to provide insights into future technology trends. The method is based on text mining clustering analysis. The method consists of two stages, which are data preparation and clustering analysis, respectively. In the first stage, tokenizing, filtering, stemming, and feature selection are implemented to transform the data into a usable format (structured data) and to extract useful information for the next stage. In the second stage, the structured data is partitioned into groups. The K-medoids algorithm is selected over the K-means algorithm for this analysis owing to its advantages in dealing with noise and outliers. The results of the analysis indicate that most current patents focus mainly on smart connectivity and smart guide systems, which play a major role in the development of smart mobility.

Analysis of Massive Scholarly Keywords using Inverted-Index based Bottom-up Clustering (역인덱스 기반 상향식 군집화 기법을 이용한 대규모 학술 핵심어 분석)

  • Oh, Heung-Seon;Jung, Yuchul
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.11
    • /
    • pp.758-764
    • /
    • 2018
  • Digital documents such as patents, scholarly papers and research reports have author keywords which summarize the topics of documents. Different documents are likely to describe the same topic if they share the same keywords. Document clustering aims at clustering documents to similar topics with an unsupervised learning method. However, it is difficult to apply to a large amount of documents event though the document clustering is utilized to in various data analysis due to computational complexity. In this case, we can cluster and connect massive documents using keywords efficiently. Existing bottom-up hierarchical clustering requires huge computation and time complexity for clustering a large number of keywords. This paper proposes an inverted index based bottom-up clustering for keywords and analyzes the results of clustering with massive keywords extracted from scholarly papers and research reports.

Efficient Clustering Algorithm based on Data Entropy for Changing Environment (상황변화에 따른 엔트로피 기반의 클러스터 구성 알고리즘)

  • Choi, Yun-Jeong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.12
    • /
    • pp.3675-3681
    • /
    • 2009
  • One of the most important factors in the lifetime of WSN(Wireless Sensor Network) is the limited resources and static control problem of the sensor nodes. In order to achieve energy efficiency and network utilities, sensor nodes can be well organized into one cluster and selected head node and normal node by dynamic conditions. Various clustering algorithms have been proposed as an efficient way to organize method based on LEACH algorithm. In this paper, we propose an efficient clustering algorithm using information entropy theory based on LEACH algorithm, which is able to recognize environmental differences according to changes from data of sensor nodes. To measure and analyze the changes of clusters, we simply compute the entropy of sensor data and applied it to probability based clustering algorithm. In experiments, we simulate the proposed method and LEACH algorithm. We have shown that our data balanced and energy efficient scheme, has high energy efficiency and network lifetime in two conditions.

A Comparison and Analysis on High-Dimensional Clustering Techniques for Data Mining (데이터 마이닝을 위한 고차원 클러스터링 기법에 관한 비교 분석 연구)

  • 김홍일;이혜명
    • Journal of the Korea Computer Industry Society
    • /
    • v.4 no.12
    • /
    • pp.887-900
    • /
    • 2003
  • Many applications require the clustering of large amounts of high dimensional data. Most automated clustering techniques have been developed but they do not work effectively and/or efficiently on high dimensional (numerical) data, which is due to the so-called “curse of dimensionality”. Moreover, the high dimensional data often contain a significant amount of noise, which causes additional ineffectiveness of algorithms. Therefore, it is necessary to look over the structure and various characteristics of high dimensional data and to develop algorithm that support clustering adapted to applications of the high dimensional database. In this paper, we investigate and classify the existing high dimensional clustering methods by analyzing the strength and weakness of each method for specific applications and comparing them. Especially, in terms of efficiency and effectiveness, we compare the traditional algorithms with CLIP which are developed by us. This study will contribute to develop more advanced algorithms than the current algorithms.

  • PDF

Design of Meteorological Radar Pattern Classifier Using Clustering-based RBFNNs : Comparative Studies and Analysis (클러스터링 기반 RBFNNs를 이용한 기상레이더 패턴분류기 설계 : 비교 연구 및 해석)

  • Choi, Woo-Yong;Oh, Sung-Kwun
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.24 no.5
    • /
    • pp.536-541
    • /
    • 2014
  • Data through meteorological radar includes ground echo, sea-clutter echo, anomalous propagation echo, clear echo and so on. Each echo is a kind of non-precipitation echoes and the characteristic of individual echoes is analyzed in order to identify with non-precipitation. Meteorological radar data is analyzed through pre-processing procedure because the data is given as big data. In this study, echo pattern classifier is designed to distinguish non-precipitation echoes from precipitation echo in meteorological radar data using RBFNNs and echo judgement module. Output performance is compared and analyzed by using both HCM clustering-based RBFNNs and FCM clustering-based RBFNNs.