• Title/Summary/Keyword: NN clustering

Search Result 34, Processing Time 0.047 seconds

Robust Similarity Measure for Spectral Clustering Based on Shared Neighbors

  • Ye, Xiucai;Sakurai, Tetsuya
    • ETRI Journal
    • /
    • v.38 no.3
    • /
    • pp.540-550
    • /
    • 2016
  • Spectral clustering is a powerful tool for exploratory data analysis. Many existing spectral clustering algorithms typically measure the similarity by using a Gaussian kernel function or an undirected k-nearest neighbor (kNN) graph, which cannot reveal the real clusters when the data are not well separated. In this paper, to improve the spectral clustering, we consider a robust similarity measure based on the shared nearest neighbors in a directed kNN graph. We propose two novel algorithms for spectral clustering: one based on the number of shared nearest neighbors, and one based on their closeness. The proposed algorithms are able to explore the underlying similarity relationships between data points, and are robust to datasets that are not well separated. Moreover, the proposed algorithms have only one parameter, k. We evaluated the proposed algorithms using synthetic and real-world datasets. The experimental results demonstrate that the proposed algorithms not only achieve a good level of performance, they also outperform the traditional spectral clustering algorithms.

Plurality Rule-based Density and Correlation Coefficient-based Clustering for K-NN

  • Aung, Swe Swe;Nagayama, Itaru;Tamaki, Shiro
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.6 no.3
    • /
    • pp.183-192
    • /
    • 2017
  • k-nearest neighbor (K-NN) is a well-known classification algorithm, being feature space-based on nearest-neighbor training examples in machine learning. However, K-NN, as we know, is a lazy learning method. Therefore, if a K-NN-based system very much depends on a huge amount of history data to achieve an accurate prediction result for a particular task, it gradually faces a processing-time performance-degradation problem. We have noticed that many researchers usually contemplate only classification accuracy. But estimation speed also plays an essential role in real-time prediction systems. To compensate for this weakness, this paper proposes correlation coefficient-based clustering (CCC) aimed at upgrading the performance of K-NN by leveraging processing-time speed and plurality rule-based density (PRD) to improve estimation accuracy. For experiments, we used real datasets (on breast cancer, breast tissue, heart, and the iris) from the University of California, Irvine (UCI) machine learning repository. Moreover, real traffic data collected from Ojana Junction, Route 58, Okinawa, Japan, was also utilized to lay bare the efficiency of this method. By using these datasets, we proved better processing-time performance with the new approach by comparing it with classical K-NN. Besides, via experiments on real-world datasets, we compared the prediction accuracy of our approach with density peaks clustering based on K-NN and principal component analysis (DPC-KNN-PCA).

Regional Extension of the Neural Network Model for Storm Surge Prediction Using Cluster Analysis (군집분석을 이용한 국지해일모델 지역확장)

  • Lee, Da-Un;Seo, Jang-Won;Youn, Yong-Hoon
    • Atmosphere
    • /
    • v.16 no.4
    • /
    • pp.259-267
    • /
    • 2006
  • In the present study, the neural network (NN) model with cluster analysis method was developed to predict storm surge in the whole Korean coastal regions with special focuses on the regional extension. The model used in this study is NN model for each cluster (CL-NN) with the cluster analysis. In order to find the optimal clustering of the stations, agglomerative method among hierarchical clustering methods was used. Various stations were clustered each other according to the centroid-linkage criterion and the cluster analysis should stop when the distances between merged groups exceed any criterion. Finally the CL-NN can be constructed for predicting storm surge in the cluster regions. To validate model results, predicted sea level value from CL-NN model was compared with that of conventional harmonic analysis (HA) and of the NN model in each region. The forecast values from NN and CL-NN models show more accuracy with observed data than that of HA. Especially the statistics analysis such as RMSE and correlation coefficient shows little differences between CL-NN and NN model results. These results show that cluster analysis and CL-NN model can be applied in the regional storm surge prediction and developed forecast system.

A Generic Algorithm for k-Nearest Neighbor Graph Construction Based on Balanced Canopy Clustering (Balanced Canopy Clustering에 기반한 일반적 k-인접 이웃 그래프 생성 알고리즘)

  • Park, Youngki;Hwang, Heasoo;Lee, Sang-Goo
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.4
    • /
    • pp.327-332
    • /
    • 2015
  • Constructing a k-nearest neighbor (k-NN) graph is a primitive operation in the field of recommender systems, information retrieval, data mining and machine learning. Although there have been many algorithms proposed for constructing a k-NN graph, either the existing approaches cannot be used for various types of similarity measures, or the performance of the approaches is decreased as the number of nodes or dimensions increases. In this paper, we present a novel algorithm for k-NN graph construction based on "balanced" canopy clustering. The experimental results show that irrespective of the number of nodes or dimensions, our algorithm is at least five times faster than the brute-force approach while retaining an accuracy of approximately 92%.

Design and Implementation of the Ensemble-based Classification Model by Using k-means Clustering

  • Song, Sung-Yeol;Khil, A-Ra
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.10
    • /
    • pp.31-38
    • /
    • 2015
  • In this paper, we propose the ensemble-based classification model which extracts just new data patterns from the streaming-data by using clustering and generates new classification models to be added to the ensemble in order to reduce the number of data labeling while it keeps the accuracy of the existing system. The proposed technique performs clustering of similar patterned data from streaming data. It performs the data labeling to each cluster at the point when a certain amount of data has been gathered. The proposed technique applies the K-NN technique to the classification model unit in order to keep the accuracy of the existing system while it uses a small amount of data. The proposed technique is efficient as using about 3% less data comparing with the existing technique as shown the simulation results for benchmarks, thereby using clustering.

Fast k-NN based Malware Analysis in a Massive Malware Environment

  • Hwang, Jun-ho;Kwak, Jin;Lee, Tae-jin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.12
    • /
    • pp.6145-6158
    • /
    • 2019
  • It is a challenge for the current security industry to respond to a large number of malicious codes distributed indiscriminately as well as intelligent APT attacks. As a result, studies using machine learning algorithms are being conducted as proactive prevention rather than post processing. The k-NN algorithm is widely used because it is intuitive and suitable for handling malicious code as unstructured data. In addition, in the malicious code analysis domain, the k-NN algorithm is easy to classify malicious codes based on previously analyzed malicious codes. For example, it is possible to classify malicious code families or analyze malicious code variants through similarity analysis with existing malicious codes. However, the main disadvantage of the k-NN algorithm is that the search time increases as the learning data increases. We propose a fast k-NN algorithm which improves the computation speed problem while taking the value of the k-NN algorithm. In the test environment, the k-NN algorithm was able to perform with only the comparison of the average of similarity of 19.71 times for 6.25 million malicious codes. Considering the way the algorithm works, Fast k-NN algorithm can also be used to search all data that can be vectorized as well as malware and SSDEEP. In the future, it is expected that if the k-NN approach is needed, and the central node can be effectively selected for clustering of large amount of data in various environments, it will be possible to design a sophisticated machine learning based system.

Spectral Clustering with Sparse Graph Construction Based on Markov Random Walk

  • Cao, Jiangzhong;Chen, Pei;Ling, Bingo Wing-Kuen;Yang, Zhijing;Dai, Qingyun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.9 no.7
    • /
    • pp.2568-2584
    • /
    • 2015
  • Spectral clustering has become one of the most popular clustering approaches in recent years. Similarity graph constructed on the data is one of the key factors that influence the performance of spectral clustering. However, the similarity graphs constructed by existing methods usually contain some unreliable edges. To construct reliable similarity graph for spectral clustering, an efficient method based on Markov random walk (MRW) is proposed in this paper. In the proposed method, theMRW model is defined on the raw k-NN graph and the neighbors of each sample are determined by the probability of the MRW. Since the high order transition probabilities carry complex relationships among data, the neighbors in the graph determined by our proposed method are more reliable than those of the existing methods. Experiments are performed on the synthetic and real-world datasets for performance evaluation and comparison. The results show that the graph obtained by our proposed method reflects the structure of the data better than those of the state-of-the-art methods and can effectively improve the performance of spectral clustering.

Optimization of granular-based RBF NN with the aid of reconstructability criterion (Reconstructability criterion을 통한 granular-based RBF NN의 최적화)

  • Park, Ho-Sung;Oh, Sung-Kwun
    • Proceedings of the KIEE Conference
    • /
    • 2009.07a
    • /
    • pp.1899_1900
    • /
    • 2009
  • 본 논문에서는 주어진 데이터의 입자화 특성을 효과적으로 모델 구축에 반영하고자 재구성 평가 기준을 통한 새로운 형태의 입자화 기반 RBF 뉴럴 네트워크를 개발한다. 주어진 데이터들의 입자화 특성을 파악하기 위해서 새로운 형태의 FCM 클러스터링(-Context-based fuzzy clustering)을 이용한다. 즉, 출력 공간의 입자화 특성은 K-means clustering 방법을 사용한 것에 반해, 입력 공간에서의 정보들은 Context-based fuzzy clustering 방법을 이용하여 효율적으로 데이터의 특성을 파악하여 모델의 구축에 반영하였으며, 또한 모델의 최적화를 위하여 RBF 뉴럴 네트워크의 은닉층의 수를 재구성 평가 기준을 통하여 모델의 최적화를 꾀하였다. 제안된 모델의 효율적인 특성을 보여주기 위해 저차원 합성 데이터를 이용하여 모델을 평가한다.

  • PDF

A Study on Data Clustering of Light Buoy Using DBSCAN(I) (DBSCAN을 이용한 등부표 위치 데이터 Clustering 연구(I))

  • Gwang-Young Choi;So-Ra Kim;Sang-Won Park;Chae-Uk Song
    • Journal of Navigation and Port Research
    • /
    • v.47 no.4
    • /
    • pp.231-238
    • /
    • 2023
  • The position of a light buoy is always flexible due to the influence of external forces such as tides and wind. The position can be checked through AIS (Automatic Identification System) or RTU (Remote Terminal Unit) for AtoN. As a result of analyzing the position data for the last five years (2017-2021) of a light buoy, the average position error was 15.4%. It is necessary to detect position error data and obtain refined position data to prevent navigation safety accidents and management. This study aimed to detect position error data and obtain refined position data by DBSCAN Clustering position data obtained through AIS or RTU for AtoN. For this purpose, 21 position data of Gunsan Port No. 1 light buoy where RTU was installed among western waters with the most position errors were DBSCAN clustered using Python library. The minPts required for DBSCAN Clustering applied the value commonly used for two-dimensional data. Epsilon was calculated and its value was applied using the k-NN (nearest neighbor) algorithm. As a result of DBSCAN Clustering, position error data that did not satisfy minPts and epsilon were detected and refined position data were acquired. This study can be used as asic data for obtaining reliable position data of a light buoy installed with AIS or RTU for AtoN. It is expected to be of great help in preventing navigation safety accidents.

The Comparison of Neural Network and k-NN Algorithm for News Article Classification (신경망 또는 k-NN에 의한 신문 기사 분류와 그의 성능 비교)

  • 조태호
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1998.10c
    • /
    • pp.363-365
    • /
    • 1998
  • 텍스트 마이닝(Text Mining)이란 텍스트형태의 문서들의 패턴 또는 관계를 추출하여 사용자가 원하는 새로운 정보를 가공하거나 기존의 정보를 변형하는 과정을 말한다. 텍스트 마이닝의 기능에는 문서 범주화(Document Categorization), 문서 군집화(Document Clustering), 그리고 문서 요약(Document Summarization)이 이에 해당된다. 문서 범주화란 문서에게 사전에 정의한 범주를 부여하는 과정을 말하고, 문서 군집화란 문서들을 계층적 구조로 형성하는 과정을 말하고, 문서 요약이란 문서의 전체 내용을 대표할 수 있는 내용의 일부만을 추출하는 과정을 말한다. 이 논문에서는 문서 범주화만을 다룰 것이며 그 대상으로는 신문기사로 설정하였다. 그의 범주는 4가지로 정치, 경제, 스포츠, 그리고 정보통신으로 설정하였다. 문서 범주화는 문서 분류(Document Classification)라고도 하며 문서에 범주를 자동으로 부여하여 기존에 인위적으로 부여함으로써 소요되는 시간과 비용을 절감하는 것이 목적이다. 문서 범주화에 대하여 k-NN(k-Nearest Neighbor)와 신경망을 이용하였으며, 신경망을 이용한 경우가 k-NN을 이용한 경우보다 성능이 우수하였다.

  • PDF