• Title/Summary/Keyword: 랜덤 포레스트 알고리즘

Search Result 86, Processing Time 0.02 seconds

Anomaly detection and attack type classification mechanism using Extra Tree and ANN (Extra Tree와 ANN을 활용한 이상 탐지 및 공격 유형 분류 메커니즘)

  • Kim, Min-Gyu;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.23 no.5
    • /
    • pp.79-85
    • /
    • 2022
  • Anomaly detection is a method to detect and block abnormal data flows in general users' data sets. The previously known method is a method of detecting and defending an attack based on a signature using the signature of an already known attack. This has the advantage of a low false positive rate, but the problem is that it is very vulnerable to a zero-day vulnerability attack or a modified attack. However, in the case of anomaly detection, there is a disadvantage that the false positive rate is high, but it has the advantage of being able to identify, detect, and block zero-day vulnerability attacks or modified attacks, so related studies are being actively conducted. In this study, we want to deal with these anomaly detection mechanisms, and we propose a new mechanism that performs both anomaly detection and classification while supplementing the high false positive rate mentioned above. In this study, the experiment was conducted with five configurations considering the characteristics of various algorithms. As a result, the model showing the best accuracy was proposed as the result of this study. After detecting an attack by applying the Extra Tree and Three-layer ANN at the same time, the attack type is classified using the Extra Tree for the classified attack data. In this study, verification was performed on the NSL-KDD data set, and the accuracy was 99.8%, 99.1%, 98.9%, 98.7%, and 97.9% for Normal, Dos, Probe, U2R, and R2L, respectively. This configuration showed superior performance compared to other models.

Predicting Functional Outcomes of Patients With Stroke Using Machine Learning: A Systematic Review (머신러닝을 활용한 뇌졸중 환자의 기능적 결과 예측: 체계적 고찰)

  • Bae, Suyeong;Lee, Mi Jung;Nam, Sanghun;Hong, Ickpyo
    • Therapeutic Science for Rehabilitation
    • /
    • v.11 no.4
    • /
    • pp.23-39
    • /
    • 2022
  • Objective : To summarize clinical and demographic variables and machine learning uses for predicting functional outcomes of patients with stroke. Methods : We searched PubMed, CINAHL and Web of Science to identify published articles from 2010 to 2021. The search terms were "machine learning OR data mining AND stroke AND function OR prediction OR/AND rehabilitation". Articles exclusively using brain imaging techniques, deep learning method and articles without available full text were excluded in this study. Results : Nine articles were selected for this study. Support vector machines (19.05%) and random forests (19.05%) were two most frequently used machine learning models. Five articles (55.56%) demonstrated that the impact of patient initial and/or discharge assessment scores such as modified ranking scale (mRS) or functional independence measure (FIM) on stroke patients' functional outcomes was higher than their clinical characteristics. Conclusions : This study showed that patient initial and/or discharge assessment scores such as mRS or FIM could influence their functional outcomes more than their clinical characteristics. Evaluating and reviewing initial and or discharge functional outcomes of patients with stroke might be required to develop the optimal therapeutic interventions to enhance functional outcomes of patients with stroke.

Estimation of Spatial Distribution Using the Gaussian Mixture Model with Multivariate Geoscience Data (다변량 지구과학 데이터와 가우시안 혼합 모델을 이용한 공간 분포 추정)

  • Kim, Ho-Rim;Yu, Soonyoung;Yun, Seong-Taek;Kim, Kyoung-Ho;Lee, Goon-Taek;Lee, Jeong-Ho;Heo, Chul-Ho;Ryu, Dong-Woo
    • Economic and Environmental Geology
    • /
    • v.55 no.4
    • /
    • pp.353-366
    • /
    • 2022
  • Spatial estimation of geoscience data (geo-data) is challenging due to spatial heterogeneity, data scarcity, and high dimensionality. A novel spatial estimation method is needed to consider the characteristics of geo-data. In this study, we proposed the application of Gaussian Mixture Model (GMM) among machine learning algorithms with multivariate data for robust spatial predictions. The performance of the proposed approach was tested through soil chemical concentration data from a former smelting area. The concentrations of As and Pb determined by ex-situ ICP-AES were the primary variables to be interpolated, while the other metal concentrations by ICP-AES and all data determined by in-situ portable X-ray fluorescence (PXRF) were used as auxiliary variables in GMM and ordinary cokriging (OCK). Among the multidimensional auxiliary variables, important variables were selected using a variable selection method based on the random forest. The results of GMM with important multivariate auxiliary data decreased the root mean-squared error (RMSE) down to 0.11 for As and 0.33 for Pb and increased the correlations (r) up to 0.31 for As and 0.46 for Pb compared to those from ordinary kriging and OCK using univariate or bivariate data. The use of GMM improved the performance of spatial interpretation of anthropogenic metals in soil. The multivariate spatial approach can be applied to understand complex and heterogeneous geological and geochemical features.

The Prediction of Survival of Breast Cancer Patients Based on Machine Learning Using Health Insurance Claim Data (건강보험 청구 데이터를 활용한 머신러닝 기반유방암 환자의 생존 여부 예측)

  • Doeggyu Lee;Kyungkeun Byun;Hyungdong Lee;Sunhee Shin
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.28 no.2
    • /
    • pp.1-9
    • /
    • 2023
  • Research using AI and big data is also being actively conducted in the health and medical fields such as disease diagnosis and treatment. Most of the existing research data used cohort data from research institutes or some patient data. In this paper, the difference in the prediction rate of survival and the factors affecting survival between breast cancer patients in their 40~50s and other age groups was revealed using health insurance review claim data held by the HIRA. As a result, the accuracy of predicting patients' survival was 0.93 on average in their 40~50s, higher than 0.86 in their 60~80s. In terms of that factor, the number of treatments was high for those in their 40~50s, and age was high for those in their 60~80s. Performance comparison with previous studies, the average precision was 0.90, which was higher than 0.81 of the existing paper. As a result of performance comparison by applied algorithm, the overall average precision of Decision Tree, Random Forest, and Gradient Boosting was 0.90, and the recall was 1.0, and the precision of multi-layer perceptrons was 0.89, and the recall was 1.0. I hope that more research will be conducted using machine learning automation(Auto ML) tools for non-professionals to enhance the use of the value for health insurance review claim data held by the HIRA.

Data-centric XAI-driven Data Imputation of Molecular Structure and QSAR Model for Toxicity Prediction of 3D Printing Chemicals (3D 프린팅 소재 화학물질의 독성 예측을 위한 Data-centric XAI 기반 분자 구조 Data Imputation과 QSAR 모델 개발)

  • ChanHyeok Jeong;SangYoun Kim;SungKu Heo;Shahzeb Tariq;MinHyeok Shin;ChangKyoo Yoo
    • Korean Chemical Engineering Research
    • /
    • v.61 no.4
    • /
    • pp.523-541
    • /
    • 2023
  • As accessibility to 3D printers increases, there is a growing frequency of exposure to chemicals associated with 3D printing. However, research on the toxicity and harmfulness of chemicals generated by 3D printing is insufficient, and the performance of toxicity prediction using in silico techniques is limited due to missing molecular structure data. In this study, quantitative structure-activity relationship (QSAR) model based on data-centric AI approach was developed to predict the toxicity of new 3D printing materials by imputing missing values in molecular descriptors. First, MissForest algorithm was utilized to impute missing values in molecular descriptors of hazardous 3D printing materials. Then, based on four different machine learning models (decision tree, random forest, XGBoost, SVM), a machine learning (ML)-based QSAR model was developed to predict the bioconcentration factor (Log BCF), octanol-air partition coefficient (Log Koa), and partition coefficient (Log P). Furthermore, the reliability of the data-centric QSAR model was validated through the Tree-SHAP (SHapley Additive exPlanations) method, which is one of explainable artificial intelligence (XAI) techniques. The proposed imputation method based on the MissForest enlarged approximately 2.5 times more molecular structure data compared to the existing data. Based on the imputed dataset of molecular descriptor, the developed data-centric QSAR model achieved approximately 73%, 76% and 92% of prediction performance for Log BCF, Log Koa, and Log P, respectively. Lastly, Tree-SHAP analysis demonstrated that the data-centric-based QSAR model achieved high prediction performance for toxicity information by identifying key molecular descriptors highly correlated with toxicity indices. Therefore, the proposed QSAR model based on the data-centric XAI approach can be extended to predict the toxicity of potential pollutants in emerging printing chemicals, chemical process, semiconductor or display process.

Preliminary Inspection Prediction Model to select the on-Site Inspected Foreign Food Facility using Multiple Correspondence Analysis (차원축소를 활용한 해외제조업체 대상 사전점검 예측 모형에 관한 연구)

  • Hae Jin Park;Jae Suk Choi;Sang Goo Cho
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.1
    • /
    • pp.121-142
    • /
    • 2023
  • As the number and weight of imported food are steadily increasing, safety management of imported food to prevent food safety accidents is becoming more important. The Ministry of Food and Drug Safety conducts on-site inspections of foreign food facilities before customs clearance as well as import inspection at the customs clearance stage. However, a data-based safety management plan for imported food is needed due to time, cost, and limited resources. In this study, we tried to increase the efficiency of the on-site inspection by preparing a machine learning prediction model that pre-selects the companies that are expected to fail before the on-site inspection. Basic information of 303,272 foreign food facilities and processing businesses collected in the Integrated Food Safety Information Network and 1,689 cases of on-site inspection information data collected from 2019 to April 2022 were collected. After preprocessing the data of foreign food facilities, only the data subject to on-site inspection were extracted using the foreign food facility_code. As a result, it consisted of a total of 1,689 data and 103 variables. For 103 variables, variables that were '0' were removed based on the Theil-U index, and after reducing by applying Multiple Correspondence Analysis, 49 characteristic variables were finally derived. We build eight different models and perform hyperparameter tuning through 5-fold cross validation. Then, the performance of the generated models are evaluated. The research purpose of selecting companies subject to on-site inspection is to maximize the recall, which is the probability of judging nonconforming companies as nonconforming. As a result of applying various algorithms of machine learning, the Random Forest model with the highest Recall_macro, AUROC, Average PR, F1-score, and Balanced Accuracy was evaluated as the best model. Finally, we apply Kernal SHAP (SHapley Additive exPlanations) to present the selection reason for nonconforming facilities of individual instances, and discuss applicability to the on-site inspection facility selection system. Based on the results of this study, it is expected that it will contribute to the efficient operation of limited resources such as manpower and budget by establishing an imported food management system through a data-based scientific risk management model.