• Title/Summary/Keyword: K-Nearest Neighbor(KNN)

Search Result 84, Processing Time 0.021 seconds

K Nearest Neighbor Joins for Big Data Processing based on Spark (Spark 기반 빅데이터 처리를 위한 K-최근접 이웃 연결)

  • JIAQI, JI;Chung, Yeongjee
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.9
    • /
    • pp.1731-1737
    • /
    • 2017
  • K Nearest Neighbor Join (KNN Join) is a simple yet effective method in machine learning. It is widely used in small dataset of the past time. As the number of data increases, it is infeasible to run this model on an actual application by a single machine due to memory and time restrictions. Nowadays a popular batch process model called MapReduce which can run on a cluster with a large number of computers is widely used for large-scale data processing. Hadoop is a framework to implement MapReduce, but its performance can be further improved by a new framework named Spark. In the present study, we will provide a KNN Join implement based on Spark. With the advantage of its in-memory calculation capability, it will be faster and more effective than Hadoop. In our experiments, we study the influence of different factors on running time and demonstrate robustness and efficiency of our approach.

Plurality Rule-based Density and Correlation Coefficient-based Clustering for K-NN

  • Aung, Swe Swe;Nagayama, Itaru;Tamaki, Shiro
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.6 no.3
    • /
    • pp.183-192
    • /
    • 2017
  • k-nearest neighbor (K-NN) is a well-known classification algorithm, being feature space-based on nearest-neighbor training examples in machine learning. However, K-NN, as we know, is a lazy learning method. Therefore, if a K-NN-based system very much depends on a huge amount of history data to achieve an accurate prediction result for a particular task, it gradually faces a processing-time performance-degradation problem. We have noticed that many researchers usually contemplate only classification accuracy. But estimation speed also plays an essential role in real-time prediction systems. To compensate for this weakness, this paper proposes correlation coefficient-based clustering (CCC) aimed at upgrading the performance of K-NN by leveraging processing-time speed and plurality rule-based density (PRD) to improve estimation accuracy. For experiments, we used real datasets (on breast cancer, breast tissue, heart, and the iris) from the University of California, Irvine (UCI) machine learning repository. Moreover, real traffic data collected from Ojana Junction, Route 58, Okinawa, Japan, was also utilized to lay bare the efficiency of this method. By using these datasets, we proved better processing-time performance with the new approach by comparing it with classical K-NN. Besides, via experiments on real-world datasets, we compared the prediction accuracy of our approach with density peaks clustering based on K-NN and principal component analysis (DPC-KNN-PCA).

Performance Improvement of Automatic Basal Cell Carcinoma Detection Using Half Hanning Window (Half Hanning 윈도우 전처리를 통한 기저 세포암 자동 검출 성능 개선)

  • Park, Aa-Ron;Baek, Seong-Joong;Min, So-Hee;You, Hong-Yoen;Kim, Jin-Young;Hong, Sung-Hoon
    • The Journal of the Korea Contents Association
    • /
    • v.6 no.12
    • /
    • pp.105-112
    • /
    • 2006
  • In this study, we propose a simple preprocessing method for classification of basal cell carcinoma (BCC), which is one of the most common skin cancer. The preprocessing step consists of data clipping with a half Hanning window and dimension reduction with principal components analysis (PCA). The application of the half Hanning window deemphasizes the peak near $1650cm^{-1}$ and improves classification performance by lowering the false negative ratio. Classification results with various classifiers are presented to show the effectiveness of the proposed method. The classifiers include maximum a posteriori probability (MAP), k-nearest neighbor (KNN), probabilistic neural network (PNN), multilayer perceptron(MLP), support vector machine (SVM) and minimum squared error (MSE) classification. Classification results with KNN involving 216 spectra preprocessed with the proposed method gave 97.3% sensitivity, which is very promising results for automatic BCC detection.

  • PDF

A comparative study of the performance of machine learning algorithms to detect malicious traffic in IoT networks (IoT 네트워크에서 악성 트래픽을 탐지하기 위한 머신러닝 알고리즘의 성능 비교연구)

  • Hyun, Mi-Jin
    • Journal of Digital Convergence
    • /
    • v.19 no.9
    • /
    • pp.463-468
    • /
    • 2021
  • Although the IoT is showing explosive growth due to the development of technology and the spread of IoT devices and activation of services, serious security risks and financial damage are occurring due to the activities of various botnets. Therefore, it is important to accurately and quickly detect the activities of these botnets. As security in the IoT environment has characteristics that require operation with minimum processing performance and memory, in this paper, the minimum characteristics for detection are selected, and KNN (K-Nearest Neighbor), Naïve Bayes, Decision Tree, Random A comparative study was conducted on the performance of machine learning algorithms such as Forest to detect botnet activity. Experimental results using the Bot-IoT dataset showed that KNN can detect DDoS, DoS, and Reconnaissance attacks most effectively and efficiently among the applied machine learning algorithms.

Three Dimensional Object Recognition using PCA and KNN (peA 와 KNN를 이용한 3차원 물체인식)

  • Lee, Kee-Jun
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.8
    • /
    • pp.57-63
    • /
    • 2009
  • Object recognition technologies using PCA(principal component analysis) recognize objects by deciding representative features of objects in the model image, extracting feature vectors from objects in a image and measuring the distance between them and object representation. Given frequent recognition problems associated with the use of point-to-point distance approach, this study adopted the k-nearest neighbor technique(class-to-class) in which a group of object models of the same class is used as recognition unit for the images in-putted on a continual input image. However, the robustness of recognition strategies using PCA depends on several factors, including illumination. When scene constancy is not secured due to varying illumination conditions, the learning performance the feature detector can be compromised, undermining the recognition quality. This paper proposes a new PCA recognition in which database of objects can be detected under different illuminations between input images and the model images.

Malware Detection Method using Opcode and windows API Calls (Opcode와 Windows API를 사용한 멀웨어 탐지)

  • Ahn, Tae-Hyun;Oh, Sang-Jin;Kwon, Young-Man
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.17 no.6
    • /
    • pp.11-17
    • /
    • 2017
  • We proposed malware detection method, which use the feature vector that consist of Opcode(operation code) and Windows API Calls extracted from executable files. And, we implemented our feature vector and measured the performance of it by using Bernoulli Naïve Bayes and K-Nearest Neighbor classifier. In experimental result, when using the K-NN classifier with the proposed method, we obtain 95.21% malware detection accuracy. It was better than existing methods using only either Opcode or Windows API Calls.

Development of Freeway Traffic Incident Clearance Time Prediction Model by Accident Level (사고등급별 고속도로 교통사고 처리시간 예측모형 개발)

  • LEE, Soong-bong;HAN, Dong Hee;LEE, Young-Ihn
    • Journal of Korean Society of Transportation
    • /
    • v.33 no.5
    • /
    • pp.497-507
    • /
    • 2015
  • Nonrecurrent congestion of freeway was primarily caused by incident. The main cause of incident was known as a traffic accident. Therefore, accurate prediction of traffic incident clearance time is very important in accident management. Traffic accident data on freeway during year 2008 to year 2014 period were analyzed for this study. KNN(K-Nearest Neighbor) algorithm was hired for developing incident clearance time prediction model with the historical traffic accident data. Analysis result of accident data explains the level of accident significantly affect on the incident clearance time. For this reason, incident clearance time was categorized by accident level. Data were sorted by classification of traffic volume, number of lanes and time periods to consider traffic conditions and roadway geometry. Factors affecting incident clearance time were analyzed from the extracted data for identifying similar types of accident. Lastly, weight of detail factors was calculated in order to measure distance metric. Weight was calculated with applying standard method of normal distribution, then incident clearance time was predicted. Prediction result of model showed a lower prediction error(MAPE) than models of previous studies. The improve model developed in this study is expected to contribute to the efficient highway operation management when incident occurs.

An Improved Preliminary Cut-off Indoor Positioning Scheme in Case of No Neighborhood Reference Point (이웃 참조 위치가 없는 경우를 개선한 실내 위치 추정 사전 컷-오프 방식)

  • Park, Byoungkwan;Kim, Dongjun;Son, Jooyoung;Choi, Jongmin
    • Journal of Korea Multimedia Society
    • /
    • v.20 no.1
    • /
    • pp.74-81
    • /
    • 2017
  • In learning stage of the preliminary Cut-off indoor positioning scheme, RSSI and UUID data received from beacons at each reference point(RP) are stored in fingerprint map. The fingerprint map and real-time beacon information are compared to identify the nearest K reference points through which the user position is estimated. If the number of K is zero, this scheme cannot estimate user position. We have improved the preliminary Cut-off scheme to get the estimated user position even in the case. The improved scheme excludes the beacon of the weakest signal received by user mobile device and identifies neighborhood reference points using the other beacon information. This procedure are performed repetitively until K > 0. The simulation results confirm that the proposed scheme outperforms K-Nearest-Neighbor (KNN), Cluster KNN and the conventional Cut-off scheme in terms of accuracy while the constraints are guaranteed to be satisfied.

Classification of Traffic Flows into QoS Classes by Unsupervised Learning and KNN Clustering

  • Zeng, Yi;Chen, Thomas M.
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.3 no.2
    • /
    • pp.134-146
    • /
    • 2009
  • Traffic classification seeks to assign packet flows to an appropriate quality of service(QoS) class based on flow statistics without the need to examine packet payloads. Classification proceeds in two steps. Classification rules are first built by analyzing traffic traces, and then the classification rules are evaluated using test data. In this paper, we use self-organizing map and K-means clustering as unsupervised machine learning methods to identify the inherent classes in traffic traces. Three clusters were discovered, corresponding to transactional, bulk data transfer, and interactive applications. The K-nearest neighbor classifier was found to be highly accurate for the traffic data and significantly better compared to a minimum mean distance classifier.

A Study on the Weight of W-KNN for WiFi Fingerprint Positioning (WiFi 핑거프린트 위치추정 방식에서 W-KNN의 가중치에 관한 연구)

  • Oh, Jongtaek
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.17 no.6
    • /
    • pp.105-111
    • /
    • 2017
  • In this paper, the analysis results are shown about several weights of Weighted K-Nearest Neighbor method, Recently, it is employed for the indoor positioning technologies using WiFi fingerprint which has been actively studied. In spite of the simplest feature, the W-KNN method shows comparable performance to another methods using WiFi fingerprint technology. So W-KNN method has employed in the existing indoor positioning system. It shows positioning error performance according to data preprocessing and weight factor, and the analysis on the weight is very important. In this paper, based on the real measured WiFi fingerprint data, the estimation error is analyzed and the performances are compared, for the case of data processing methods, of the weight of average, variance, and distance, and of the averaging several position of number K. These results could be practically useful to construct the real indoor positioning system.