• Title/Summary/Keyword: 불균형데이터 처리

Search Result 115, Processing Time 0.025 seconds

Improvement of early prediction performance of under-performing students using anomaly data (이상 데이터를 활용한 성과부진학생의 조기예측성능 향상)

  • Hwang, Chul-Hyun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.11
    • /
    • pp.1608-1614
    • /
    • 2022
  • As competition between universities intensifies due to the recent decrease in the number of students, it is recognized as an essential task of universities to predict students who are underperforming at an early stage and to make various efforts to prevent dropouts. For this, a high-performance model that accurately predicts student performance is essential. This paper proposes a method to improve prediction performance by removing or amplifying abnormal data in a classification prediction model for identifying underperforming students. Existing anomaly data processing methods have mainly focused on deleting or ignoring data, but this paper presents a criterion to distinguish noise from change indicators, and contributes to improving the performance of predictive models by deleting or amplifying data. In an experiment using open learning performance data for verification of the proposed method, we found a number of cases in which the proposed method can improve classification performance compared to the existing method.

Effective Harmony Search-Based Optimization of Cost-Sensitive Boosting for Improving the Performance of Cross-Project Defect Prediction (교차 프로젝트 결함 예측 성능 향상을 위한 효과적인 하모니 검색 기반 비용 민감 부스팅 최적화)

  • Ryu, Duksan;Baik, Jongmoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.3
    • /
    • pp.77-90
    • /
    • 2018
  • Software Defect Prediction (SDP) is a field of study that identifies defective modules. With insufficient local data, a company can exploit Cross-Project Defect Prediction (CPDP), a way to build a classifier using dataset collected from other companies. Most machine learning algorithms for SDP have used more than one parameter that significantly affects prediction performance depending on different values. The objective of this study is to propose a parameter selection technique to enhance the performance of CPDP. Using a Harmony Search algorithm (HS), our approach tunes parameters of cost-sensitive boosting, a method to tackle class imbalance causing the difficulty of prediction. According to distributional characteristics, parameter ranges and constraint rules between parameters are defined and applied to HS. The proposed approach is compared with three CPDP methods and a Within-Project Defect Prediction (WPDP) method over fifteen target projects. The experimental results indicate that the proposed model outperforms the other CPDP methods in the context of class imbalance. Unlike the previous researches showing high probability of false alarm or low probability of detection, our approach provides acceptable high PD and low PF while providing high overall performance. It also provides similar performance compared with WPDP.

Online Hard Example Mining for Training One-Stage Object Detectors (단-단계 물체 탐지기 학습을 위한 고난도 예들의 온라인 마이닝)

  • Kim, Incheol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.5
    • /
    • pp.195-204
    • /
    • 2018
  • In this paper, we propose both a new loss function and an online hard example mining scheme for improving the performance of single-stage object detectors which use deep convolutional neural networks. The proposed loss function and the online hard example mining scheme can not only overcome the problem of imbalance between the number of annotated objects and the number of background examples, but also improve the localization accuracy of each object. Therefore, the loss function and the mining scheme can provide intrinsically fast single-stage detectors with detection performance higher than or similar to that of two-stage detectors. In experiments conducted with the PASCAL VOC 2007 benchmark dataset, we show that the proposed loss function and the online hard example mining scheme can improve the performance of single-stage object detectors.

Autoencoder Based N-Segmentation Frequency Domain Anomaly Detection for Optimization of Facility Defect Identification (설비 결함 식별 최적화를 위한 오토인코더 기반 N 분할 주파수 영역 이상 탐지)

  • Kichang Park;Yongkwan Lee
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.3
    • /
    • pp.130-139
    • /
    • 2024
  • Artificial intelligence models are being used to detect facility anomalies using physics data such as vibration, current, and temperature for predictive maintenance in the manufacturing industry. Since the types of facility anomalies, such as facility defects and failures, anomaly detection methods using autoencoder-based unsupervised learning models have been mainly applied. Normal or abnormal facility conditions can be effectively classified using the reconstruction error of the autoencoder, but there is a limit to identifying facility anomalies specifically. When facility anomalies such as unbalance, misalignment, and looseness occur, the facility vibration frequency shows a pattern different from the normal state in a specific frequency range. This paper presents an N-segmentation anomaly detection method that performs anomaly detection by dividing the entire vibration frequency range into N regions. Experiments on nine kinds of anomaly data with different frequencies and amplitudes using vibration data from a compressor showed better performance when N-segmentation was applied. The proposed method helps materialize them after detecting facility anomalies.

Learning Text Chunking Using Maximum Entropy Models (최대 엔트로피 모델을 이용한 텍스트 단위화 학습)

  • Park, Seong-Bae;Zhang, Byoung-Tak
    • Annual Conference on Human and Language Technology
    • /
    • 2001.10d
    • /
    • pp.130-137
    • /
    • 2001
  • 최대 엔트로피 모델(maximum entropy model)은 여러 가지 자연언어 문제를 학습하는데 성공적으로 적용되어 왔지만, 두 가지의 주요한 문제점을 가지고 있다. 그 첫번째 문제는 해당 언어에 대한 많은 사전 지식(prior knowledge)이 필요하다는 것이고, 두번째 문제는 계산량이 너무 많다는 것이다. 본 논문에서는 텍스트 단위화(text chunking)에 최대 엔트로피 모델을 적용하는 데 나타나는 이 문제점들을 해소하기 위해 새로운 방법을 제시한다. 사전 지식으로, 간단한 언어 모델로부터 쉽게 생성된 결정트리(decision tree)에서 자동적으로 만들어진 규칙을 사용한다. 따라서, 제시된 방법에서의 최대 엔트로피 모델은 결정트리를 보강하는 방법으로 간주될 수 있다. 계산론적 복잡도를 줄이기 위해서, 최대 엔트로피 모델을 학습할 때 일종의 능동 학습(active learning) 방법을 사용한다. 전체 학습 데이터가 아닌 일부분만을 사용함으로써 계산 비용은 크게 줄어 들 수 있다. 실험 결과, 제시된 방법으로 결정트리의 오류의 수가 반으로 줄었다. 대부분의 자연언어 데이터가 매우 불균형을 이루므로, 학습된 모델을 부스팅(boosting)으로 강화할 수 있다. 부스팅을 한 후 제시된 방법은 전문가에 의해 선택된 자질로 학습된 최대 엔트로피 모델보다 졸은 성능을 보이며 지금까지 보고된 기계 학습 알고리즘 중 가장 성능이 좋은 방법과 비슷한 성능을 보인다 텍스트 단위화가 일반적으로 전체 구문분석의 전 단계이고 이 단계에서의 오류가 다음 단계에서 복구될 수 없으므로 이 성능은 텍스트 단위화에서 매우 의미가 길다.

  • PDF

Comparison of Classification Performance Between Adult and Elderly Using Acoustic and Linguistic Features from Spontaneous Speech (자유대화의 음향적 특징 및 언어적 특징 기반의 성인과 노인 분류 성능 비교)

  • SeungHoon Han;Byung Ok Kang;Sunghee Dong
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.8
    • /
    • pp.365-370
    • /
    • 2023
  • This paper aims to compare the performance of speech data classification into two groups, adult and elderly, based on the acoustic and linguistic characteristics that change due to aging, such as changes in respiratory patterns, phonation, pitch, frequency, and language expression ability. For acoustic features we used attributes related to the frequency, amplitude, and spectrum of speech voices. As for linguistic features, we extracted hidden state vector representations containing contextual information from the transcription of speech utterances using KoBERT, a Korean pre-trained language model that has shown excellent performance in natural language processing tasks. The classification performance of each model trained based on acoustic and linguistic features was evaluated, and the F1 scores of each model for the two classes, adult and elderly, were examined after address the class imbalance problem by down-sampling. The experimental results showed that using linguistic features provided better performance for classifying adult and elderly than using acoustic features, and even when the class proportions were equal, the classification performance for adult was higher than that for elderly.

Development of diet-based personalized nutritional supplement recommendation system (식단 기반 개인 맞춤형 영양제 추천 시스템 개발)

  • Hong, Seong-Jun;Lee, Min-Hee;Jang, Jae-Ri;Jeong, Ha-Eun;Hong, Yu-Ri;Lee, Jee-Hang;Kim, Jin
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.11a
    • /
    • pp.359-361
    • /
    • 2022
  • 최근 현대인의 영양불균형이 점점 심화됨에 따라 영양결핍과 비만의 위험도가 점점 증가하고 있다. 이에 따라 건강기능식품에 대한 관심이 증가하여 일반인들의 건강기능식품 소비가 증가하고 있지만, 적정섭취량에 비해 영양소를 과도하게 섭취 중이거나 영양제를 먹지만 정작 필요한 영양소를 섭취하지 못하는 경우가 빈번히 나타나고 있다. 이러한 문제를 해소하고자 본 논문에서는 7 일간 사용자가 섭취한 식단을 기반으로 부족한 영양소를 수치상으로 계산하여 개인 맞춤 영양제를 추천하는 시스템을 제안한다.

Run-time Memory Optimization Algorithm for the DDMB Architecture (DDMB 구조에서의 런타임 메모리 최적화 알고리즘)

  • Cho, Jeong-Hun;Paek, Yun-Heung;Kwon, Soo-Hyun
    • The KIPS Transactions:PartA
    • /
    • v.13A no.5 s.102
    • /
    • pp.413-420
    • /
    • 2006
  • Most vendors of digital signal processors (DSPs) support a Harvard architecture, which has two or more memory buses, one for program and one or more for data and allow the processor to access multiple words of data from memory in a single instruction cycle. We already addressed how to efficiently assign data to multi-memory banks in our previous work. This paper reports on our recent attempt to optimize run-time memory. The run-time environment for dual data memory banks (DBMBs) requires two run-time stacks to control activation records located in two memory banks corresponding to calling procedures. However, activation records of two memory banks for a procedure are able to have different size. As a consequence, dual run-time stacks can be unbalanced whenever a procedure is called. This unbalance between two memory banks causes that usage of one memory bank can exceed the extent of on-chip memory area although there is free area in the other memory bank. We attempt balancing dual run-time slacks to enhance efficiently utilization of on-chip memory in this paper. The experimental results have revealed that although our algorithm is relatively quite simple, it still can utilize run-time memories efficiently; thus enabling our compiler to run extremely fast, yet minimizing the usage of un-time memory in the target code.

A Study on the Link Cost Estimation for Data Reliability in Wireless Sensor Network (무선 센서 네트워크에서 데이터 신뢰성을 위한 링크 비용 산출 방안에 관한 연구)

  • Lee, Dae-hee;Cho, Kyoung-woo;Kang, Chul-gyu;Oh, Chang-heon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2018.10a
    • /
    • pp.571-573
    • /
    • 2018
  • Wireless sensor networks have unbalanced energy consumption due to the convergence structure in which data is concentrated to sink nodes. To solve this problem, in the previous research, the relay node was placed between the source node and the sink node to merge the data before being concentrated to the sink node. However, selecting a relay node that does not consider the link quality causes packet loss according to the link quality of the reconfigured routing path. Therefore, in this paper, we propose a link cost calculation method for data reliability in routing path reconfiguration for relay node selection. We propose a link cost estimation formula considering the number of hops and RSSI as the routing metric value and select the RSSI threshold value through the packet transmission experiment between the sensor modules.

  • PDF

Comparative Study of Anomaly Detection Accuracy of Intrusion Detection Systems Based on Various Data Preprocessing Techniques (다양한 데이터 전처리 기법 기반 침입탐지 시스템의 이상탐지 정확도 비교 연구)

  • Park, Kyungseon;Kim, Kangseok
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.11
    • /
    • pp.449-456
    • /
    • 2021
  • An intrusion detection system is a technology that detects abnormal behaviors that violate security, and detects abnormal operations and prevents system attacks. Existing intrusion detection systems have been designed using statistical analysis or anomaly detection techniques for traffic patterns, but modern systems generate a variety of traffic different from existing systems due to rapidly growing technologies, so the existing methods have limitations. In order to overcome this limitation, study on intrusion detection methods applying various machine learning techniques is being actively conducted. In this study, a comparative study was conducted on data preprocessing techniques that can improve the accuracy of anomaly detection using NGIDS-DS (Next Generation IDS Database) generated by simulation equipment for traffic in various network environments. Padding and sliding window were used as data preprocessing, and an oversampling technique with Adversarial Auto-Encoder (AAE) was applied to solve the problem of imbalance between the normal data rate and the abnormal data rate. In addition, the performance improvement of detection accuracy was confirmed by using Skip-gram among the Word2Vec techniques that can extract feature vectors of preprocessed sequence data. PCA-SVM and GRU were used as models for comparative experiments, and the experimental results showed better performance when sliding window, skip-gram, AAE, and GRU were applied.