• Title/Summary/Keyword: K-NN(K-Nearest Neighbor)

Search Result 198, Processing Time 0.025 seconds

A Study on the Data Fusion for Data Enrichment (데이터 보강을 위한 데이터 통합기법에 관한 연구)

  • 정성석;김순영;김현진
    • The Korean Journal of Applied Statistics
    • /
    • v.17 no.3
    • /
    • pp.605-617
    • /
    • 2004
  • One of the best important thing in data mining process is the quality of data used. When we perform the mining on data with excellent quality, the potential value of data mining can be improved. In this paper, we propose the data fusion technique for data enrichment that one phase can improve data quality in KDD process. We attempted to add k-NN technique to the regression technique, to improve performance of fusion technique through reduction of the loss of information. Simulations were performed to compare the proposed data fusion technique with the regression technique. As a result, the newly proposed data fusion technique is characterized with low MSE in continuous fusion variables.

Development of Traffic Prediction and Optimal Traffic Control System for Highway based on Cell Transmission Model in Cloud Environment (Cell Transmission Model 시뮬레이션을 기반으로 한 클라우드 환경 아래에서의 고속도로 교통 예측 및 최적 제어 시스템 개발)

  • Tak, Se-hyun;Yeo, Hwasoo
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.15 no.4
    • /
    • pp.68-80
    • /
    • 2016
  • This study proposes the traffic prediction and optimal traffic control system based on cell transmission model and genetic algorithm in cloud environment. The proposed prediction and control system consists of four parts. 1) Data preprocessing module detects and imputes the corrupted data and missing data points. 2) Data-driven traffic prediction module predicts the future traffic state using Multi-level K-Nearest Neighbor (MK-NN) Algorithm with stored historical data in SQL database. 3) Online traffic simulation module simulates the future traffic state in various situations including accident, road work, and extreme weather condition with predicted traffic data by MK-NN. 4) Optimal road control module produces the control strategy for large road network with cell transmission model and genetic algorithm. The results show that proposed system can effectively reduce the Vehicle Hours Traveled upto 60%.

Designing Hypothesis of 2-Substituted-N-[4-(1-methyl-4,5-diphenyl-1H-imidazole-2-yl)phenyl] Acetamide Analogs as Anticancer Agents: QSAR Approach

  • Bedadurge, Ajay B.;Shaikh, Anwar R.
    • Journal of the Korean Chemical Society
    • /
    • v.57 no.6
    • /
    • pp.744-754
    • /
    • 2013
  • Quantitative structure-activity relationship (QSAR) analysis for recently synthesized imidazole-(benz)azole and imidazole - piperazine derivatives was studied for their anticancer activities against breast (MCF-7) cell lines. The statistically significant 2D-QSAR models ($r^2=0.8901$; $q^2=0.8130$; F test = 36.4635; $r^2$ se = 0.1696; $q^2$ se = 0.12212; pred_$r^2=0.4229$; pred_$r^2$ se = 0.4606 and $r^2=0.8763$; $q^2=0.7617$; F test = 31.8737; $r^2$ se = 0.1951; $q^2$ se = 0.2708; pred_$r^2=0.4386$; pred_$r^2$ se = 0.3950) were developed using molecular design suite (VLifeMDS 4.2). The study was performed with 18 compounds (data set) using random selection and manual selection methods used for the division of the data set into training and test set. Multiple linear regression (MLR) methodology with stepwise (SW) forward-backward variable selection method was used for building the QSAR models. The results of the 2D-QSAR models were further compared with 3D-QSAR models generated by kNN-MFA, (k-Nearest Neighbor Molecular Field Analysis) investigating the substitutional requirements for the favorable anticancer activity. The results derived may be useful in further designing novel imidazole-(benz)azole and imidazole-piperazine derivatives against breast (MCF-7) cell lines prior to synthesis.

Efficient Processing of k-Farthest Neighbor Queries for Road Networks

  • Kim, Taelee;Cho, Hyung-Ju;Hong, Hee Ju;Nam, Hyogeun;Cho, Hyejun;Do, Gyung Yoon;Jeon, Pilkyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.10
    • /
    • pp.79-89
    • /
    • 2019
  • While most research focuses on the k-nearest neighbors (kNN) queries in the database community, an important type of proximity queries called k-farthest neighbors (kFN) queries has not received much attention. This paper addresses the problem of finding the k-farthest neighbors in road networks. Given a positive integer k, a query object q, and a set of data points P, a kFN query returns k data objects farthest from the query object q. Little attention has been paid to processing kFN queries in road networks. The challenge of processing kFN queries in road networks is reducing the number of network distance computations, which is the most prominent difference between a road network and a Euclidean space. In this study, we propose an efficient algorithm called FANS for k-FArthest Neighbor Search in road networks. We present a shared computation strategy to avoid redundant computation of the distances between a query object and data objects. We also present effective pruning techniques based on the maximum distance from a query object to data segments. Finally, we demonstrate the efficiency and scalability of our proposed solution with extensive experiments using real-world roadmaps.

Feature Selection to Predict Very Short-term Heavy Rainfall Based on Differential Evolution (미분진화 기반의 초단기 호우예측을 위한 특징 선택)

  • Seo, Jae-Hyun;Lee, Yong Hee;Kim, Yong-Hyuk
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.22 no.6
    • /
    • pp.706-714
    • /
    • 2012
  • The Korea Meteorological Administration provided the recent four-years records of weather dataset for our very short-term heavy rainfall prediction. We divided the dataset into three parts: train, validation and test set. Through feature selection, we select only important features among 72 features to avoid significant increase of solution space that arises when growing exponentially with the dimensionality. We used a differential evolution algorithm and two classifiers as the fitness function of evolutionary computation to select more accurate feature subset. One of the classifiers is Support Vector Machine (SVM) that shows high performance, and the other is k-Nearest Neighbor (k-NN) that is fast in general. The test results of SVM were more prominent than those of k-NN in our experiments. Also we processed the weather data using undersampling and normalization techniques. The test results of our differential evolution algorithm performed about five times better than those using all features and about 1.36 times better than those using a genetic algorithm, which is the best known. Running times when using a genetic algorithm were about twenty times longer than those when using a differential evolution algorithm.

Korean Sentence Boundary Detection Using Memory-based Machine Learning (메모리 기반의 기계 학습을 이용한 한국어 문장 경계 인식)

  • Han Kun-Heui;Lim Heui-Seok
    • The Journal of the Korea Contents Association
    • /
    • v.4 no.4
    • /
    • pp.133-139
    • /
    • 2004
  • This paper proposes a Korean sentence boundary detection system which employs k-nearest neighbor algorithm. We proposed three scoring functions to classify sentence boundary and performed comparative analysis. We uses domain independent linguistic features in order to make a general and robust system. The proposed system was trained and evaluated on the two kinds of corpus; ETRI corpus and KAIST corpus. As experimental results, the proposed system shows about $98.82\%$ precision and $99.09\%$ recall rate even though it was trained on relatively small corpus.

  • PDF

Identification of Differentially Expressed Genes Using Tests Based on Multiple Imputations

  • Kim, Sang Cheol;Yu, Donghyeon
    • Quantitative Bio-Science
    • /
    • v.36 no.1
    • /
    • pp.23-31
    • /
    • 2017
  • Datasets from DNA microarray experiments, which are in the form of large matrices of expression levels of genes, often have missing values. However, the existing statistical methods including the principle components analysis (PCA) and Hotelling's t-test are not directly applicable for the datasets having missing values due to the fact that they assume the observed dataset is complete in general. Many methods have been proposed in previous literature to impute the missing in the observed data. Troyanskaya et al. [1] study the k-nearest neighbor (kNN) imputation, Kim et al. [2] propose the local least squares (LLS) method and Rubin [3] propose the multiple imputation (MI) for missing values. To identify differentially expressed genes, we propose a new testing procedure when the missing exists in the observed data. The proposed procedure uses the Stouffer's z-scores and combines the test results of individual imputed samples, which are dependent to each other. We numerically show that the proposed test procedure based on MI performs better than the existing test procedures based on single imputation (SI) by comparing their ROC curves. We apply the proposed method to analyzing a public microarray data.

Comparison of Classification and Convolution algorithm in Condition assessment of the Failure Modes in Rotational equipments with varying speed (회전수가 변하는 기기의 상태 진단에 있어서 특성 기반 분류 알고리즘과 합성곱 기반 알고리즘의 예측 정확도 비교)

  • Ki-Yeong Moon;Se-Yun Hwang;Jang-Hyun Lee
    • Proceedings of the Korean Institute of Navigation and Port Research Conference
    • /
    • 2022.06a
    • /
    • pp.301-301
    • /
    • 2022
  • 본 연구는 운영 조건이 달라짐에 따라 회전수가 변하는 기기의 정상적 가동 여부와 고장 종류를 판별하기 위한 인공지능 알고리즘의 적용을 다루고 있다. 회전수가 변하는 장비로부터 계측된 상태 모니터링 센서의 신호는 비정상(non-stationary)적 특성이 있으므로, 상태 신호의 한계치가 고장 판별의 기준이 되기 어렵다는 점을 해결하고자 하였다. 정상 가동 여부는 이상 감지에 효율적인 오토인코더 및 기계학습 알고리즘을 적용하였으며, 고장 종류 판별에는 기계학습법과 합성곱 기반의 심층학습 방법을 적용하였다. 변하는 회전수와 연계된 주파수의 비정상적 시계열도 적절한 고장 특징 (Feature)로 대변될 수 있도록 시간 및 주파수 영역에서 특징 벡터를 구성할 수 있음을 예제로 설명하였다. 차원 축소 및 카이 제곱 기법을 적용하여 최적의 특징 벡터를 추출하여 기계학습의 분류 알고리즘이 비정상적 회전 신호를 가진 장비의 고장 예측에 활용될 수 있음을 보였다. 이 과정에서 k-NN(k-Nearest Neighbor), SVM(Support Vector Machine), Random Forest의 기계학습 알고리즘을 적용하였다. 또한 시계열 기반의 오토인코더 및 CNN (Convolution Neural Network) 적용하여 이상 감지와 고장진단을 수행한 결과를 비교하여 제시하였다.

  • PDF

Simultaneous Motion Recognition Framework using Data Augmentation based on Muscle Activation Model (근육 활성화 모델 기반의 데이터 증강을 활용한 동시 동작 인식 프레임워크)

  • Sejin Kim;Wan Kyun Chung
    • The Journal of Korea Robotics Society
    • /
    • v.19 no.2
    • /
    • pp.203-212
    • /
    • 2024
  • Simultaneous motion is essential in the activities of daily living (ADL). For motion intention recognition, surface electromyogram (sEMG) and corresponding motion label is necessary. However, this process is time-consuming and it may increase the burden of the user. Therefore, we propose a simultaneous motion recognition framework using data augmentation based on muscle activation model. The model consists of multiple point sources to be optimized while the number of point sources and their initial parameters are automatically determined. From the experimental results, it is shown that the framework has generated the data which are similar to the real one. This aspect is quantified with the following two metrics: structural similarity index measure (SSIM) and mean squared error (MSE). Furthermore, with k-nearest neighbor (k-NN) or support vector machine (SVM), the classification accuracy is also enhanced with the proposed framework. From these results, it can be concluded that the generalization property of the training data is enhanced and the classification accuracy is increased accordingly. We expect that this framework reduces the burden of the user from the excessive and time-consuming data acquisition.

Development of methodology for daily rainfall simulation considering distribution of rainfall events in each duration (강우사상의 지속기간별 분포 특성을 고려한 일강우 모의 기법 개발)

  • Jung, Jaewon;Kim, Soojun;Kim, Hung Soo
    • Journal of Korea Water Resources Association
    • /
    • v.52 no.2
    • /
    • pp.141-148
    • /
    • 2019
  • When simulating the daily rainfall amount by existing Markov Chain model, it is general to simulate the rainfall occurrence and to estimate the rainfall amount randomly from the distribution which is similar to the daily rainfall distribution characteristic using Monte Carlo simulation. At this time, there is a limitation that the characteristics of rainfall intensity and distribution by time according to the rainfall duration are not reflected in the results. In this study, 1-day, 2-day, 3-day, 4-day rainfall event are classified, and the rainfall amount is estimated by rainfall duration. In other words, the distributions of the total amount of rainfall event by the duration are set using the Kernel Density Estimation (KDE), the daily rainfall in each day are estimated from the distribution of each duration. Total rainfall amount determined for each event are divided into each daily rainfall considering the type of daily distribution of the rainfall event which has most similar rainfall amount of the observed rainfall using the k-Nearest Neighbor algorithm (KNN). This study is to develop the limitation of the existing rainfall estimation method, and it is expected that this results can use for the future rainfall estimation and as the primary data in water resource design.