• Title/Summary/Keyword: K-nearest neighbor classification

Search Result 182, Processing Time 0.024 seconds

A personalized exercise recommendation system using dimension reduction algorithms

  • Lee, Ha-Young;Jeong, Ok-Ran
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.6
    • /
    • pp.19-28
    • /
    • 2021
  • Nowadays, interest in health care is increasing due to Coronavirus (COVID-19), and a lot of people are doing home training as there are more difficulties in using fitness centers and public facilities that are used together. In this paper, we propose a personalized exercise recommendation algorithm using personalized propensity information to provide more accurate and meaningful exercise recommendation to home training users. Thus, we classify the data according to the criteria for obesity with a k-nearest neighbor algorithm using personal information that can represent individuals, such as eating habits information and physical conditions. Furthermore, we differentiate the exercise dataset by the level of exercise activities. Based on the neighborhood information of each dataset, we provide personalized exercise recommendations to users through a dimensionality reduction algorithm (SVD) among model-based collaborative filtering methods. Therefore, we can solve the problem of data sparsity and scalability of memory-based collaborative filtering recommendation techniques and we verify the accuracy and performance of the proposed algorithms.

A Comparative Study of Prediction Models for College Student Dropout Risk Using Machine Learning: Focusing on the case of N university (머신러닝을 활용한 대학생 중도탈락 위험군의 예측모델 비교 연구 : N대학 사례를 중심으로)

  • So-Hyun Kim;Sung-Hyoun Cho
    • Journal of The Korean Society of Integrative Medicine
    • /
    • v.12 no.2
    • /
    • pp.155-166
    • /
    • 2024
  • Purpose : This study aims to identify key factors for predicting dropout risk at the university level and to provide a foundation for policy development aimed at dropout prevention. This study explores the optimal machine learning algorithm by comparing the performance of various algorithms using data on college students' dropout risks. Methods : We collected data on factors influencing dropout risk and propensity were collected from N University. The collected data were applied to several machine learning algorithms, including random forest, decision tree, artificial neural network, logistic regression, support vector machine (SVM), k-nearest neighbor (k-NN) classification, and Naive Bayes. The performance of these models was compared and evaluated, with a focus on predictive validity and the identification of significant dropout factors through the information gain index of machine learning. Results : The binary logistic regression analysis showed that the year of the program, department, grades, and year of entry had a statistically significant effect on the dropout risk. The performance of each machine learning algorithm showed that random forest performed the best. The results showed that the relative importance of the predictor variables was highest for department, age, grade, and residence, in the order of whether or not they matched the school location. Conclusion : Machine learning-based prediction of dropout risk focuses on the early identification of students at risk. The types and causes of dropout crises vary significantly among students. It is important to identify the types and causes of dropout crises so that appropriate actions and support can be taken to remove risk factors and increase protective factors. The relative importance of the factors affecting dropout risk found in this study will help guide educational prescriptions for preventing college student dropout.

A Study on Adaptive Learning Model for Performance Improvement of Stream Analytics (실시간 데이터 분석의 성능개선을 위한 적응형 학습 모델 연구)

  • Ku, Jin-Hee
    • Journal of Convergence for Information Technology
    • /
    • v.8 no.1
    • /
    • pp.201-206
    • /
    • 2018
  • Recently, as technologies for realizing artificial intelligence have become more common, machine learning is widely used. Machine learning provides insight into collecting large amounts of data, batch processing, and taking final action, but the effects of the work are not immediately integrated into the learning process. In this paper proposed an adaptive learning model to improve the performance of real-time stream analysis as a big business issue. Adaptive learning generates the ensemble by adapting to the complexity of the data set, and the algorithm uses the data needed to determine the optimal data point to sample. In an experiment for six standard data sets, the adaptive learning model outperformed the simple machine learning model for classification at the learning time and accuracy. In particular, the support vector machine showed excellent performance at the end of all ensembles. Adaptive learning is expected to be applicable to a wide range of problems that need to be adaptively updated in the inference of changes in various parameters over time.

Classifying Severity of Senior Driver Accidents In Capital Regions Based on Machine Learning Algorithms (머신러닝 기반의 수도권 지역 고령운전자 차대사람 사고심각도 분류 연구)

  • Kim, Seunghoon;Lym, Youngbin;Kim, Ki-Jung
    • Journal of Digital Convergence
    • /
    • v.19 no.4
    • /
    • pp.25-31
    • /
    • 2021
  • Moving toward an aged society, traffic accidents involving elderly drivers have also attracted broader public attention. A rapid increase of senior involvement in crashes calls for developing appropriate crash-severity prediction models specific to senior drivers. In that regard, this study leverages machine learning (ML) algorithms so as to predict the severity of vehicle-pedestrian collisions induced by elderly drivers. Specifically, four ML algorithms (i.e., Logistic model, K-nearest Neighbor (KNN), Random Forest (RF), and Support Vector Machine (SVM)) have been developed and compared. Our results show that Logistic model and SVM have outperformed their rivals in terms of the overall prediction accuracy, while precision measure exhibits in favor of RF. We also clarify that driver education and technology development would be effective countermeasures against severity risks of senior driver-induced collisions. These allow us to support informed decision making for policymakers to enhance public safety.

Inhalation Configuration Detection for COVID-19 Patient Secluded Observing using Wearable IoTs Platform

  • Sulaiman Sulmi Almutairi;Rehmat Ullah;Qazi Zia Ullah;Habib Shah
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.6
    • /
    • pp.1478-1499
    • /
    • 2024
  • Coronavirus disease (COVID-19) is an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus. COVID-19 become an active epidemic disease due to its spread around the globe. The main causes of the spread are through interaction and transmission of the droplets through coughing and sneezing. The spread can be minimized by isolating the susceptible patients. However, it necessitates remote monitoring to check the breathing issues of the patient remotely to minimize the interactions for spread minimization. Thus, in this article, we offer a wearable-IoTs-centered framework for remote monitoring and recognition of the breathing pattern and abnormal breath detection for timely providing the proper oxygen level required. We propose wearable sensors accelerometer and gyroscope-based breathing time-series data acquisition, temporal features extraction, and machine learning algorithms for pattern detection and abnormality identification. The sensors provide the data through Bluetooth and receive it at the server for further processing and recognition. We collect the six breathing patterns from the twenty subjects and each pattern is recorded for about five minutes. We match prediction accuracies of all machine learning models under study (i.e. Random forest, Gradient boosting tree, Decision tree, and K-nearest neighbor. Our results show that normal breathing and Bradypnea are the most correctly recognized breathing patterns. However, in some cases, algorithm recognizes kussmaul well also. Collectively, the classification outcomes of Random Forest and Gradient Boost Trees are better than the other two algorithms.

Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis (부도예측을 위한 KNN 앙상블 모형의 동시 최적화)

  • Min, Sung-Hwan
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.139-157
    • /
    • 2016
  • Bankruptcy involves considerable costs, so it can have significant effects on a country's economy. Thus, bankruptcy prediction is an important issue. Over the past several decades, many researchers have addressed topics associated with bankruptcy prediction. Early research on bankruptcy prediction employed conventional statistical methods such as univariate analysis, discriminant analysis, multiple regression, and logistic regression. Later on, many studies began utilizing artificial intelligence techniques such as inductive learning, neural networks, and case-based reasoning. Currently, ensemble models are being utilized to enhance the accuracy of bankruptcy prediction. Ensemble classification involves combining multiple classifiers to obtain more accurate predictions than those obtained using individual models. Ensemble learning techniques are known to be very useful for improving the generalization ability of the classifier. Base classifiers in the ensemble must be as accurate and diverse as possible in order to enhance the generalization ability of an ensemble model. Commonly used methods for constructing ensemble classifiers include bagging, boosting, and random subspace. The random subspace method selects a random feature subset for each classifier from the original feature space to diversify the base classifiers of an ensemble. Each ensemble member is trained by a randomly chosen feature subspace from the original feature set, and predictions from each ensemble member are combined by an aggregation method. The k-nearest neighbors (KNN) classifier is robust with respect to variations in the dataset but is very sensitive to changes in the feature space. For this reason, KNN is a good classifier for the random subspace method. The KNN random subspace ensemble model has been shown to be very effective for improving an individual KNN model. The k parameter of KNN base classifiers and selected feature subsets for base classifiers play an important role in determining the performance of the KNN ensemble model. However, few studies have focused on optimizing the k parameter and feature subsets of base classifiers in the ensemble. This study proposed a new ensemble method that improves upon the performance KNN ensemble model by optimizing both k parameters and feature subsets of base classifiers. A genetic algorithm was used to optimize the KNN ensemble model and improve the prediction accuracy of the ensemble model. The proposed model was applied to a bankruptcy prediction problem by using a real dataset from Korean companies. The research data included 1800 externally non-audited firms that filed for bankruptcy (900 cases) or non-bankruptcy (900 cases). Initially, the dataset consisted of 134 financial ratios. Prior to the experiments, 75 financial ratios were selected based on an independent sample t-test of each financial ratio as an input variable and bankruptcy or non-bankruptcy as an output variable. Of these, 24 financial ratios were selected by using a logistic regression backward feature selection method. The complete dataset was separated into two parts: training and validation. The training dataset was further divided into two portions: one for the training model and the other to avoid overfitting. The prediction accuracy against this dataset was used to determine the fitness value in order to avoid overfitting. The validation dataset was used to evaluate the effectiveness of the final model. A 10-fold cross-validation was implemented to compare the performances of the proposed model and other models. To evaluate the effectiveness of the proposed model, the classification accuracy of the proposed model was compared with that of other models. The Q-statistic values and average classification accuracies of base classifiers were investigated. The experimental results showed that the proposed model outperformed other models, such as the single model and random subspace ensemble model.

A Learning Agent for Automatic Bookmark Classification (북 마크 자동 분류를 위한 학습 에이전트)

  • Kim, In-Cheol;Cho, Soo-Sun
    • The KIPS Transactions:PartB
    • /
    • v.8B no.5
    • /
    • pp.455-462
    • /
    • 2001
  • The World Wide Web has become one of the major services provided through Internet. When searching the vast web space, users use bookmarking facilities to record the sites of interests encountered during the course of navigation. One of the typical problems arising from bookmarking is that the list of bookmarks lose coherent organization when the the becomes too lengthy, thus ceasing to function as a practical finding aid. In order to maintain the bookmark file in an efficient, organized manner, the user has to classify all the bookmarks newly added to the file, and update the folders. This paper introduces our learning agent called BClassifier that automatically classifies bookmarks by analyzing the contents of the corresponding web documents. The chief source for the training examples are the bookmarks already classified into several bookmark folders according to their subject by the user. Additionally, the web pages found under top categories of Yahoo site are collected and included in the training examples for diversifying the subject categories to be represented, and the training examples for these categories as well. Our agent employs naive Bayesian learning method that is a well-tested, probability-based categorizing technique. In this paper, the outcome of some experimentation is also outlined and evaluated. A comparison of naive Bayesian learning method alongside other learning methods such as k-Nearest Neighbor and TFIDF is also presented.

  • PDF

Optimizing Similarity Threshold and Coverage of CBR (사례기반추론의 유사 임계치 및 커버리지 최적화)

  • Ahn, Hyunchul
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.8
    • /
    • pp.535-542
    • /
    • 2013
  • Since case-based reasoning(CBR) has many advantages, it has been used for supporting decision making in various areas including medical checkup, production planning, customer classification, and so on. However, there are several factors to be set by heuristics when designing effective CBR systems. Among these factors, this study addresses the issue of selecting appropriate neighbors in case retrieval step. As the criterion for selecting appropriate neighbors, conventional studies have used the preset number of neighbors to combine(i.e. k of k-nearest neighbor), or the relative portion of the maximum similarity. However, this study proposes to use the absolute similarity threshold varying from 0 to 1, as the criterion for selecting appropriate neighbors to combine. In this case, too small similarity threshold value may make the model rarely produce the solution. To avoid this, we propose to adopt the coverage, which implies the ratio of the cases in which solutions are produced over the total number of the training cases, and to set it as the constraint when optimizing the similarity threshold. To validate the usefulness of the proposed model, we applied it to a real-world target marketing case of an online shopping mall in Korea. As a result, we found that the proposed model might significantly improve the performance of CBR.

Bayesian Network-Based Analysis on Clinical Data of Infertility Patients (베이지안 망에 기초한 불임환자 임상데이터의 분석)

  • Jung, Yong-Gyu;Kim, In-Cheol
    • The KIPS Transactions:PartB
    • /
    • v.9B no.5
    • /
    • pp.625-634
    • /
    • 2002
  • In this paper, we conducted various experiments with Bayesian networks in order to analyze clinical data of infertility patients. With these experiments, we tried to find out inter-dependencies among important factors playing the key role in clinical pregnancy, and to compare 3 different kinds of Bayesian network classifiers (including NBN, BAN, GBN) in terms of classification performance. As a result of experiments, we found the fact that the most important features playing the key role in clinical pregnancy (Clin) are indication (IND), stimulation, age of female partner (FA), number of ova (ICT), and use of Wallace (ETM), and then discovered inter-dependencies among these features. And we made sure that BAN and GBN, which are more general Bayesian network classifiers permitting inter-dependencies among features, show higher performance than NBN. By comparing Bayesian classifiers based on probabilistic representation and reasoning with other classifiers such as decision trees and k-nearest neighbor methods, we found that the former show higher performance than the latter due to inherent characteristics of clinical domain. finally, we suggested a feature reduction method in which all features except only some ones within Markov blanket of the class node are removed, and investigated by experiments whether such feature reduction can increase the performance of Bayesian classifiers.

Ensemble of Nested Dichotomies for Activity Recognition Using Accelerometer Data on Smartphone (Ensemble of Nested Dichotomies 기법을 이용한 스마트폰 가속도 센서 데이터 기반의 동작 인지)

  • Ha, Eu Tteum;Kim, Jeongmin;Ryu, Kwang Ryel
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.4
    • /
    • pp.123-132
    • /
    • 2013
  • As the smartphones are equipped with various sensors such as the accelerometer, GPS, gravity sensor, gyros, ambient light sensor, proximity sensor, and so on, there have been many research works on making use of these sensors to create valuable applications. Human activity recognition is one such application that is motivated by various welfare applications such as the support for the elderly, measurement of calorie consumption, analysis of lifestyles, analysis of exercise patterns, and so on. One of the challenges faced when using the smartphone sensors for activity recognition is that the number of sensors used should be minimized to save the battery power. When the number of sensors used are restricted, it is difficult to realize a highly accurate activity recognizer or a classifier because it is hard to distinguish between subtly different activities relying on only limited information. The difficulty gets especially severe when the number of different activity classes to be distinguished is very large. In this paper, we show that a fairly accurate classifier can be built that can distinguish ten different activities by using only a single sensor data, i.e., the smartphone accelerometer data. The approach that we take to dealing with this ten-class problem is to use the ensemble of nested dichotomy (END) method that transforms a multi-class problem into multiple two-class problems. END builds a committee of binary classifiers in a nested fashion using a binary tree. At the root of the binary tree, the set of all the classes are split into two subsets of classes by using a binary classifier. At a child node of the tree, a subset of classes is again split into two smaller subsets by using another binary classifier. Continuing in this way, we can obtain a binary tree where each leaf node contains a single class. This binary tree can be viewed as a nested dichotomy that can make multi-class predictions. Depending on how a set of classes are split into two subsets at each node, the final tree that we obtain can be different. Since there can be some classes that are correlated, a particular tree may perform better than the others. However, we can hardly identify the best tree without deep domain knowledge. The END method copes with this problem by building multiple dichotomy trees randomly during learning, and then combining the predictions made by each tree during classification. The END method is generally known to perform well even when the base learner is unable to model complex decision boundaries As the base classifier at each node of the dichotomy, we have used another ensemble classifier called the random forest. A random forest is built by repeatedly generating a decision tree each time with a different random subset of features using a bootstrap sample. By combining bagging with random feature subset selection, a random forest enjoys the advantage of having more diverse ensemble members than a simple bagging. As an overall result, our ensemble of nested dichotomy can actually be seen as a committee of committees of decision trees that can deal with a multi-class problem with high accuracy. The ten classes of activities that we distinguish in this paper are 'Sitting', 'Standing', 'Walking', 'Running', 'Walking Uphill', 'Walking Downhill', 'Running Uphill', 'Running Downhill', 'Falling', and 'Hobbling'. The features used for classifying these activities include not only the magnitude of acceleration vector at each time point but also the maximum, the minimum, and the standard deviation of vector magnitude within a time window of the last 2 seconds, etc. For experiments to compare the performance of END with those of other methods, the accelerometer data has been collected at every 0.1 second for 2 minutes for each activity from 5 volunteers. Among these 5,900 ($=5{\times}(60{\times}2-2)/0.1$) data collected for each activity (the data for the first 2 seconds are trashed because they do not have time window data), 4,700 have been used for training and the rest for testing. Although 'Walking Uphill' is often confused with some other similar activities, END has been found to classify all of the ten activities with a fairly high accuracy of 98.4%. On the other hand, the accuracies achieved by a decision tree, a k-nearest neighbor, and a one-versus-rest support vector machine have been observed as 97.6%, 96.5%, and 97.6%, respectively.