• 제목/요약/키워드: class imbalance

검색결과 119건 처리시간 0.024초

혼합샘플링 기법을 사용한 랜섬웨어탐지 성능향상에 관한 연구 (A study on the improvement ransomware detection performance using combine sampling methods)

  • 김수철;이형동;변경근;신용태
    • 융합보안논문지
    • /
    • 제23권1호
    • /
    • pp.69-77
    • /
    • 2023
  • 최근 아일랜드 보건당국, 미(美) 송유관 등 전(全) 세계적으로 랜섬웨어 피해가 급증하고 있으며, 사회 모든 분야에 피해를 입히고 있다. 특히, 랜섬웨어 탐지 및 대응에 기존의 탐지방법뿐 아니라 머신러닝 등을 이용한 연구가 늘어 나고 있다. 하지만, 전통적인 머신러닝은 모델이 데이터가 많은 쪽으로 예측하는 경향이 강해 정확한 예측값을 추출하기 어려운 문제점이 있다. 이에 다수(Majority)의 Non-Ransomware(정상코드 또는 멀웨어)와 소수의(Minority) Ransomware로 구성된 불균형(Imbalance) 클래스에서 샘플링 기법을 통해 불균형을 해소하고 랜섬웨어탐지 성능을 향상시키는 기법을 제안하였다. 본 실험에서는 두가지 시나리오(Binary, Multi Classification)을 사용하여 샘플링 기법이 다수 클래스의 탐지 성능을 유지하면서 소수 클래스의 탐지 성능을 개선함을 확인하였다. 특히, 제안된 혼합샘플링 기법(SMOTE+ENN)이 10% 이상의 성능(G-mean, F1-score) 향상을 도출했다.

메탈부쉬 누락예방을 위한 데이터마이닝 기법의 적용 및 비교 (Application and Comparison of Data Mining Technique to Prevent Metal-Bush Omission)

  • 고상현;이동주
    • 산업경영시스템학회지
    • /
    • 제46권3호
    • /
    • pp.139-147
    • /
    • 2023
  • The metal bush assembling process is a process of inserting and compressing a metal bush that serves to reduce the occurrence of noise and stable compression in the rotating section. In the metal bush assembly process, the head diameter defect and placement defect of the metal bush occur due to metal bush omission, non-pressing, and poor press-fitting. Among these causes of defects, it is intended to prevent defects due to omission of the metal bush by using signals from sensors attached to the facility. In particular, a metal bush omission is predicted through various data mining techniques using left load cell value, right load cell value, current, and voltage as independent variables. In the case of metal bush omission defect, it is difficult to get defect data, resulting in data imbalance. Data imbalance refers to a case where there is a large difference in the number of data belonging to each class, which can be a problem when performing classification prediction. In order to solve the problem caused by data imbalance, oversampling and composite sampling techniques were applied in this study. In addition, simulated annealing was applied for optimization of parameters related to sampling and hyper-parameters of data mining techniques used for bush omission prediction. In this study, the metal bush omission was predicted using the actual data of M manufacturing company, and the classification performance was examined. All applied techniques showed excellent results, and in particular, the proposed methods, the method of mixing Random Forest and SA, and the method of mixing MLP and SA, showed better results.

Teaching English Literature and Critical Thinking, beyond just Language Acquisition

  • Kim, Yeun-Kyong
    • 영어어문교육
    • /
    • 제16권4호
    • /
    • pp.71-90
    • /
    • 2010
  • This study suggests that English literature educators need to be eclectic and flexible in applying theories and methods, not simply adhering to one or two for all situations and occasions. They need to be available to go with the flow and particularly employ whatever is needed at any given moment of class time. There is a current trend emphasizing English literature as merely a language resource rather than the study of English literature as an end in itself. Without much attention given to literary analysis and criticism, students tend to lack creative and critical thinking abilities. Given the current imbalance, it would seem important to address the issue, and create English class programs that maintain a balance between teaching the study of English literature to improve students' critical thinking abilities, and its use as a language resource. To fulfill this goal, thorough preparation is required. Indeed, we can direct our intelligence more effectively when we are well prepared and we are familiar with the basic methods and mechanics of teaching our subject. The greatest achievement of the English literature class I taught was that the students showed unexpectedly remarkable creative and critical appreciation of the novel we studied, in addition to improving their English language skills.

  • PDF

머신러닝 기반 한국 청소년의 자살 생각 예측 모델 (Machine learning-based Predictive Model of Suicidal Thoughts among Korean Adolescents.)

  • YeaJu JIN;HyunKi KIM
    • Journal of Korea Artificial Intelligence Association
    • /
    • 제1권1호
    • /
    • pp.1-6
    • /
    • 2023
  • This study developed models using decision forest, support vector machine, and logistic regression methods to predict and prevent suicidal ideation among Korean adolescents. The study sample consisted of 51,407 individuals after removing missing data from the raw data of the 18th (2022) Youth Health Behavior Survey conducted by the Korea Centers for Disease Control and Prevention. Analysis was performed using the MS Azure program with Two-Class Decision Forest, Two-Class Support Vector Machine, and Two-Class Logistic Regression. The results of the study showed that the decision forest model achieved an accuracy of 84.8% and an F1-score of 36.7%. The support vector machine model achieved an accuracy of 86.3% and an F1-score of 24.5%. The logistic regression model achieved an accuracy of 87.2% and an F1-score of 40.1%. Applying the logistic regression model with SMOTE to address data imbalance resulted in an accuracy of 81.7% and an F1-score of 57.7%. Although the accuracy slightly decreased, the recall, precision, and F1-score improved, demonstrating excellent performance. These findings have significant implications for the development of prediction models for suicidal ideation among Korean adolescents and can contribute to the prevention and improvement of youth suicide.

인물 개체 분할을 위한 맥락-의존적 비디오 데이터 보강 (Context-Dependent Video Data Augmentation for Human Instance Segmentation)

  • 전현진;이종훈;김인철
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제12권5호
    • /
    • pp.217-228
    • /
    • 2023
  • 비디오 개체 분할은 비디오를 구성하는 영상 프레임 각각에 대해 관심 개체 분할을 수행해야 할 뿐만 아니라, 해당 비디오를 구성하는 프레임 시퀀스 전체에 걸쳐 개체들에 대한 정확한 트래킹을 요구하기 때문에 난이도가 높은 기술이다. 특히 드라마 비디오에서 인물 개체 분할은 다양한 장소와 시간대에서 상호 작용하는 복수의 주요 등장인물들에 대한 정확한 트래킹을 요구하는 특징을 가지고 있다. 또한, 드라마 비디오 인물 개체분할은 주연 인물들과 조연 혹은 보조 출연 인물들 간의 등장 빈도에 상당한 차이가 있어 일종의 클래스 불균형 문제도 있다. 본 논문에서는 미생 드라마 비디오들을 토대로 구축한 인물 개체 분할 데이터 집합인 MHIS를 소개하고, 등장인물 클래스 간의 심각한 데이터 불균형 문제를 효과적으로 해결하기 위한 새로운 비디오 데이터 보강 기법인 CDVA를 제안한다. 기존의 비디오 데이터 보강 기법들과는 달리, 새로운 CDVA 보강 기법은 비디오들의 시-공간적 맥락을 충분히 고려해서 목표 인물이 삽입되어야 할 배경 클립 내의 위치를 결정함으로써, 보다 더 현실적인 보강 비디오들을 생성한다. 따라서 본 논문에서 제안하는 새로운 비디오 데이터 보강 기법인 CDVA는 비디오 개체 분할을 위한 심층 신경망 모델의 성능을 효과적으로 향상시킬 수 있다. 본 논문에서는 MHIS 데이터 집합을 이용한 다양한 정량 및 정성 실험들을 통해, 제안 비디오 데이터 보강 기법의 유용성과 효과를 입증한다.

머신러닝 CatBoost 다중 분류 알고리즘을 이용한 조류 발생 예측 모형 성능 평가 연구 (Evaluation of Multi-classification Model Performance for Algal Bloom Prediction Using CatBoost)

  • 김준오;박정수
    • 한국물환경학회지
    • /
    • 제39권1호
    • /
    • pp.1-8
    • /
    • 2023
  • Monitoring and prediction of water quality are essential for effective river pollution prevention and water quality management. In this study, a multi-classification model was developed to predict chlorophyll-a (Chl-a) level in rivers. A model was developed using CatBoost, a novel ensemble machine learning algorithm. The model was developed using hourly field monitoring data collected from January 1 to December 31, 2015. For model development, chl-a was classified into class 1 (Chl-a≤10 ㎍/L), class 2 (10<Chl-a≤50 ㎍/L), and class 3 (Chl-a>50 ㎍/L), where the number of data used for the model training were 27,192, 11,031, and 511, respectively. The macro averages of precision, recall, and F1-score for the three classes were 0.58, 0.58, and 0.58, respectively, while the weighted averages were 0.89, 0.90, and 0.89, for precision, recall, and F1-score, respectively. The model showed relatively poor performance for class 3 where the number of observations was much smaller compared to the other two classes. The imbalance of data distribution among the three classes was resolved by using the synthetic minority over-sampling technique (SMOTE) algorithm, where the number of data used for model training was evenly distributed as 26,868 for each class. The model performance was improved with the macro averages of precision, rcall, and F1-score of the three classes as 0.58, 0.70, and 0.59, respectively, while the weighted averages were 0.88, 0.84, and 0.86 after SMOTE application.

컴퓨터 단층촬영 영상에서 3번 요추부 슬라이스 검출을 위한 최적화 기반 딥러닝 모델 (Optimization-based Deep Learning Model to Localize L3 Slice in Whole Body Computerized Tomography Images)

  • 채성원;조재현;박예은;정진형;김성진;최안렬
    • 한국정보전자통신기술학회논문지
    • /
    • 제16권5호
    • /
    • pp.331-337
    • /
    • 2023
  • 본 논문에서는 근감소증의 발병 여부와 정도를 확인하기 위해 3번 요추부 (L3) CT 영상을 검출하는 딥러닝 모델을 제안하는 것이다. 또한, CT 데이터 내에 L3 레벨과 L3 레벨이 아닌 부분의 데이터 불균형으로 인한 성능 저하의 문제점을 오버샘플링 비율과 클래스 가중치를 설계변수로 하는 최적화 기법을 제시하고자 한다. 모델 학습 및 검증을 위하여 강릉아산병원에 내원한 전립선암 환자 104명, 방광암 환자 46명의 총 150명의 전신 CT 영상이 활용되었다. 딥러닝 모델은 ResNet50을 활용하였으며, 최적화기법의 설계변수로는 모델 하이퍼파라미터 5종과 데이터 증강비율 및 클래스 가중치로 선정하였다. 제안하는 최적화 기반의 L3 레벨 추출 모델은 대조군 (하이퍼파라미터 5종만을 최적화한 모델)과 비교하여 중간 L3 오차가 약 1.0 슬라이스 감소한 것을 확인할 수 있었다. 본 연구결과를 통하여 정확한 L3 슬라이스 검출이 가능하며, 추가적으로 데이터 증강을 통한 오버 샘플링과 클래스 가중치 조절을 통해 데이터 불균형 문제를 효과적으로 해결할 수 있는 가능성을 제시할 수 있다.

온-보드에서의 딥러닝을 활용한 드론의 실시간 객체 인식 연구 (A Study on Realtime Drone Object Detection Using On-board Deep Learning)

  • 이장우;김주영;김재경;권철희
    • 한국항공우주학회지
    • /
    • 제49권10호
    • /
    • pp.883-892
    • /
    • 2021
  • 본 논문에서는 드론을 활용한 감시정찰 임무의 효율성을 향상하기 위해 드론 탑재장비에서 실시간으로 구동 가능한 딥러닝 기반의 객체 인식 모델을 개발하는 연구를 수행하였다. 드론 영상 내 객체 인식 성능을 높이는 목적으로 학습 단계에서 학습 데이터 전처리 및 증강, 전이 학습을 수행하였고 각 클래스 별 성능 편차를 줄이기 위해 가중 크로스 엔트로피 방법을 적용하였다. 추론 속도를 개선하기 위해 양자화 기법이 적용된 추론 가속화 엔진을 생성하여 실시간성을 높였다. 마지막으로 모델의 성능을 확인하기 위해 학습에 참여하지 않은 드론 영상 데이터에서 인식 성능 및 실시간성을 분석하였다.

승용자율주행을 위한 의미론적 분할 데이터셋 유효성 검증 (Validation of Semantic Segmentation Dataset for Autonomous Driving)

  • 곽석우;나호용;김경수;송은지;정세영;이계원;정지현;황성호
    • 드라이브 ㆍ 컨트롤
    • /
    • 제19권4호
    • /
    • pp.104-109
    • /
    • 2022
  • For autonomous driving research using AI, datasets collected from road environments play an important role. In other countries, various datasets such as CityScapes, A2D2, and BDD have already been released, but datasets suitable for the domestic road environment still need to be provided. This paper analyzed and verified the dataset reflecting the Korean driving environment. In order to verify the training dataset, the class imbalance was confirmed by comparing the number of pixels and instances of the dataset. A similar A2D2 dataset was trained with the same deep learning model, ConvNeXt, to compare and verify the constructed dataset. IoU was compared for the same class between two datasets with ConvNeXt and mIoU was compared. In this paper, it was confirmed that the collected dataset reflecting the driving environment of Korea is suitable for learning.

Using Machine Learning Technique for Analytical Customer Loyalty

  • Mohamed M. Abbassy
    • International Journal of Computer Science & Network Security
    • /
    • 제23권8호
    • /
    • pp.190-198
    • /
    • 2023
  • To enhance customer satisfaction for higher profits, an e-commerce sector can establish a continuous relationship and acquire new customers. Utilize machine-learning models to analyse their customer's behavioural evidence to produce their competitive advantage to the e-commerce platform by helping to improve overall satisfaction. These models will forecast customers who will churn and churn causes. Forecasts are used to build unique business strategies and services offers. This work is intended to develop a machine-learning model that can accurately forecast retainable customers of the entire e-commerce customer data. Developing predictive models classifying different imbalanced data effectively is a major challenge in collected data and machine learning algorithms. Build a machine learning model for solving class imbalance and forecast customers. The satisfaction accuracy is used for this research as evaluation metrics. This paper aims to enable to evaluate the use of different machine learning models utilized to forecast satisfaction. For this research paper are selected three analytical methods come from various classifications of learning. Classifier Selection, the efficiency of various classifiers like Random Forest, Logistic Regression, SVM, and Gradient Boosting Algorithm. Models have been used for a dataset of 8000 records of e-commerce websites and apps. Results indicate the best accuracy in determining satisfaction class with both gradient-boosting algorithm classifications. The results showed maximum accuracy compared to other algorithms, including Gradient Boosting Algorithm, Support Vector Machine Algorithm, Random Forest Algorithm, and logistic regression Algorithm. The best model developed for this paper to forecast satisfaction customers and accuracy achieve 88 %.