• Title/Summary/Keyword: tree-based models

Search Result 437, Processing Time 0.031 seconds

API Feature Based Ensemble Model for Malware Family Classification (악성코드 패밀리 분류를 위한 API 특징 기반 앙상블 모델 학습)

  • Lee, Hyunjong;Euh, Seongyul;Hwang, Doosung
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.29 no.3
    • /
    • pp.531-539
    • /
    • 2019
  • This paper proposes the training features for malware family analysis and analyzes the multi-classification performance of ensemble models. We construct training data by extracting API and DLL information from malware executables and use Random Forest and XGBoost algorithms which are based on decision tree. API, API-DLL, and DLL-CM features for malware detection and family classification are proposed by analyzing frequently used API and DLL information from malware and converting high-dimensional features to low-dimensional features. The proposed feature selection method provides the advantages of data dimension reduction and fast learning. In performance comparison, the malware detection rate is 93.0% for Random Forest, the accuracy of malware family dataset is 92.0% for XGBoost, and the false positive rate of malware family dataset including benign is about 3.5% for Random Forest and XGBoost.

Analysis of Hypertension Risk Factors by Life Cycle Based on Machine Learning (머신러닝 기반 생애주기별 고혈압 위험 요인 분석)

  • Kang, SeongAn;Kim, SoHui;Ryu, Min Ho
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.27 no.5
    • /
    • pp.73-82
    • /
    • 2022
  • Chronic diseases such as hypertension require a differentiated approach according to age and life cycle. Chronic diseases such as hypertension require differentiated management according to the life cycle. It is also known that the cause of hypertension is a combination of various factors. This study uses machine learning prediction techniques to analyze various factors affecting hypertension by life cycle. To this end, a total of 35 variables were used through preprocessing and variable selection processes for the National Health and Nutrition Survey data of the Korea Centers for Disease Control and Prevention. As a result of the study, among the tree-based machine learning models, XGBoost was found to have high predictive performance in both middle and old age. Looking at the risk factors for hypertension by life cycle, individual characteristic factors, genetic factors, and nutritional intake factors were found to be risk factors for hypertension in the middle age, and nutritional intake factors, dietary factors, and lifestyle factors were derived as risk factors for hypertension. The results of this study are expected to be used as basic data useful for hypertension management by life cycle.

A Determining System for the Category of Need in Long-Term Care Insurance System using Decision Tree Model (의사결정나무기법을 이용한 노인장기요양보험 등급결정모형 개발)

  • Han, Eun-Jeong;Kwak, Min-Jeong;Kan, Im-Oak
    • The Korean Journal of Applied Statistics
    • /
    • v.24 no.1
    • /
    • pp.145-159
    • /
    • 2011
  • National long-term care insurance started in July, 2008. We try to make up for weak points and develop a long-term care insurance system. Especially, it is important to upgrade the rating model of the category of need for long-term care continually. We improve the rating model using the data after enforcement of the system to reflect the rapidly changing long-term care marketplace. A decision tree model was adpoted to upgrade the rating model that makes it easy to compare with the current system. This model is based on the first assumption that, a person with worse functional conditions needs more long-term care services than others. Second, the volume of long-term care services are de ned as a service time. This study was conducted to reflect the changing circumstances. Rating models have to be continually improved to reflect changing circumstances, like the infrastructure of the system or the characteristics of the insurance beneficiary.

Self Introduction Essay Classification Using Doc2Vec for Efficient Job Matching (Doc2Vec 모형에 기반한 자기소개서 분류 모형 구축 및 실험)

  • Kim, Young Soo;Moon, Hyun Sil;Kim, Jae Kyeong
    • Journal of Information Technology Services
    • /
    • v.19 no.1
    • /
    • pp.103-112
    • /
    • 2020
  • Job seekers are making various efforts to find a good company and companies attempt to recruit good people. Job search activities through self-introduction essay are nowadays one of the most active processes. Companies spend time and cost to reviewing all of the numerous self-introduction essays of job seekers. Job seekers are also worried about the possibility of acceptance of their self-introduction essays by companies. This research builds a classification model and conducted an experiments to classify self-introduction essays into pass or fail using deep learning and decision tree techniques. Real world data were classified using stratified sampling to alleviate the data imbalance problem between passed self-introduction essays and failed essays. Documents were embedded using Doc2Vec method developed from existing Word2Vec, and they were classified using logistic regression analysis. The decision tree model was chosen as a benchmark model, and K-fold cross-validation was conducted for the performance evaluation. As a result of several experiments, the area under curve (AUC) value of PV-DM results better than that of other models of Doc2Vec, i.e., PV-DBOW and Concatenate. Furthmore PV-DM classifies passed essays as well as failed essays, while PV_DBOW can not classify passed essays even though it classifies well failed essays. In addition, the classification performance of the logistic regression model embedded using the PV-DM model is better than the decision tree-based classification model. The implication of the experimental results is that company can reduce the cost of recruiting good d job seekers. In addition, our suggested model can help job candidates for pre-evaluating their self-introduction essays.

Adaptive Frequent Pattern Algorithm using CAWFP-Tree based on RHadoop Platform (RHadoop 플랫폼기반 CAWFP-Tree를 이용한 적응 빈발 패턴 알고리즘)

  • Park, In-Kyu
    • Journal of Digital Convergence
    • /
    • v.15 no.6
    • /
    • pp.229-236
    • /
    • 2017
  • An efficient frequent pattern algorithm is essential for mining association rules as well as many other mining tasks for convergence with its application spread over a very broad spectrum. Models for mining pattern have been proposed using a FP-tree for storing compressed information about frequent patterns. In this paper, we propose a centroid frequent pattern growth algorithm which we called "CAWFP-Growth" that enhances he FP-Growth algorithm by making the center of weights and frequencies for the itemsets. Because the conventional constraint of maximum weighted support is not necessary to maintain the downward closure property, it is more likely to reduce the search time and the information loss of the frequent patterns. The experimental results show that the proposed algorithm achieves better performance than other algorithms without scarifying the accuracy and increasing the processing time via the centroid of the items. The MapReduce framework model is provided to handle large amounts of data via a pseudo-distributed computing environment. In addition, the modeling of the proposed algorithm is required in the fully distributed mode.

2D-THI: Two-Dimensional Type Hierarchy Index for XML Databases (2D-THI: XML 데이테베이스를 위한 이차원 타입상속 계층색인)

  • Lee Jong-Hak
    • Journal of Korea Multimedia Society
    • /
    • v.9 no.3
    • /
    • pp.265-278
    • /
    • 2006
  • This paper presents a two-dimensional type inheritance hierarchy index(2D-THI) for XML databases. XML Schema is one of schema models for the XML documents supporting. The type inheritance. The conventional indexing techniques for XML databases can not support XML queries on type inheritance hierarchies. We construct a two-dimensional index structure using multidimensional file organizations for supporting type inheritance hierarchy in XML queries. This indexing technique deals with the problem of clustering index entries in the two-dimensional domain space that consists of a key element domain and a type identifier domain based on the user query pattern. This index enhances query performance by adjusting the degree of clustering between the two domains. For performance evaluation, we have compared our proposed 2D-THI with the conventional class hierarchy indexing techniques in object-oriented databases such as CH-index and CG-tree through the cost model. As the result of the performance evaluations, we have verified that our proposed two-dimensional type inheritance indexing technique can efficiently support the query Processing in XML databases according to the query types.

  • PDF

Models for Technology Evolution Path Creation Based on Citation Tree to Investigate Technology Opportunity Discovery (기술기회발굴을 위한 문서인용기반 기술진화 경로 생성 모형)

  • Lee, Jae-Min;Lee, Bangrae;Moon, Yeong-Ho;Kwon, Oh-Jin
    • Journal of Korea Technology Innovation Society
    • /
    • v.14 no.spc
    • /
    • pp.1152-1170
    • /
    • 2011
  • We selected core documents from enormous number of documents by analyzing citation relation of papers or patents including direct citation and indirect citation. Then we tried to creat the technology evolution path from citation network tree. By applying the method to the patent DB of OLED (Organic Light Emitting Diode), we obtained genealogical citation network of core patents of OLED. We analyzed how the one of OLED technology was transferred to the semiconductor related technology and we named the process of transition as 'technology evolution path' of OLED technology. And we also analyzed the genealogical citation network of papers on graphene. From the analysis, we found that the weight count method including indirect citation was better in evaluating the value of technology of a paper than the times cited method.

  • PDF

A Study on Predictive Modeling of I-131 Radioactivity Based on Machine Learning (머신러닝 기반 고용량 I-131의 용량 예측 모델에 관한 연구)

  • Yeon-Wook You;Chung-Wun Lee;Jung-Soo Kim
    • Journal of radiological science and technology
    • /
    • v.46 no.2
    • /
    • pp.131-139
    • /
    • 2023
  • High-dose I-131 used for the treatment of thyroid cancer causes localized exposure among radiology technologists handling it. There is a delay between the calibration date and when the dose of I-131 is administered to a patient. Therefore, it is necessary to directly measure the radioactivity of the administered dose using a dose calibrator. In this study, we attempted to apply machine learning modeling to measured external dose rates from shielded I-131 in order to predict their radioactivity. External dose rates were measured at 1 m, 0.3 m, and 0.1 m distances from a shielded container with the I-131, with a total of 868 sets of measurements taken. For the modeling process, we utilized the hold-out method to partition the data with a 7:3 ratio (609 for the training set:259 for the test set). For the machine learning algorithms, we chose linear regression, decision tree, random forest and XGBoost. To evaluate the models, we calculated root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE) to evaluate accuracy and R2 to evaluate explanatory power. Evaluation results are as follows. Linear regression (RMSE 268.15, MSE 71901.87, MAE 231.68, R2 0.92), decision tree (RMSE 108.89, MSE 11856.92, MAE 19.24, R2 0.99), random forest (RMSE 8.89, MSE 79.10, MAE 6.55, R2 0.99), XGBoost (RMSE 10.21, MSE 104.22, MAE 7.68, R2 0.99). The random forest model achieved the highest predictive ability. Improving the model's performance in the future is expected to contribute to lowering exposure among radiology technologists.

Studies on the Determination of the Breast-Height Form Factors for Stem of Pinus thunbergii and Cryptomeria japonica (곰솔 및 삼나무의 흉고형수(胸高形數) 결정(決定)에 관한 연구(硏究))

  • Park, Nam Chang;Chung, Young Gwan
    • Journal of Korean Society of Forest Science
    • /
    • v.70 no.1
    • /
    • pp.28-37
    • /
    • 1985
  • In order to estimate breast-height form factors of Pinus thunbergii and Cryptomeria japonica, 8 models based on tree age, diameter at breast height and tree height were suggested and evaluated. It was the following equations that turned out to be most fit for estimating them; for Pinus thunbergii, $F=0.553-4.567\;1/A+71.409\;1/A^2$ ($R^2=0.928$), based on tree age, ($6.727^{**}$) ($14.100^{**}$) $F=0.356+1.774\;1/D-0.770\;1/D^2$ ($R^2=0.944$), based on diameter at breast height, ($15.102^{**}$) ($2.908^{**}$) $F=0.316+1.546\;1/H+0.397\;1/H^2$ ($R^2=0.941$), based on tree height, ($8.380^{**}$) ($3.896^{**}$) for Cryptomeria japonica, $F=0.400+2.348\;1/A+17.053\;1/A^2$ ($R^2=0.889$), based on tree age, ($3.501^{**}$) ($3.298^{**}$) $F=0.353+2.118\;1/D-1.462\;1/D^2$ ($R^2=0.923$), based on diameter at breast height, ($14.873^{**}$) ($3.545^{**}$) $F=0.403+0.427\;1/H+2.843\;1/H^2$ ($R^2=0.887$), based on tree height. ($3.254^{**}$) ($5.742^{**}$) The above estimated breast-height form factors proved to be overestimated for young trees and small diameter trees, and to be underestimated for old trees and large diameter trees, in comparison to generally accepted figure in Korea, that is, the form factor of 0.45.

  • PDF

Application of Multiple Linear Regression Analysis and Tree-Based Machine Learning Techniques for Cutter Life Index(CLI) Prediction (커터수명지수 예측을 위한 다중선형회귀분석과 트리 기반 머신러닝 기법 적용)

  • Ju-Pyo Hong;Tae Young Ko
    • Tunnel and Underground Space
    • /
    • v.33 no.6
    • /
    • pp.594-609
    • /
    • 2023
  • TBM (Tunnel Boring Machine) method is gaining popularity in urban and underwater tunneling projects due to its ability to ensure excavation face stability and minimize environmental impact. Among the prominent models for predicting disc cutter life, the NTNU model uses the Cutter Life Index(CLI) as a key parameter, but the complexity of testing procedures and rarity of equipment make measurement challenging. In this study, CLI was predicted using multiple linear regression analysis and tree-based machine learning techniques, utilizing rock properties. Through literature review, a database including rock uniaxial compressive strength, Brazilian tensile strength, equivalent quartz content, and Cerchar abrasivity index was built, and derived variables were added. The multiple linear regression analysis selected input variables based on statistical significance and multicollinearity, while the machine learning prediction model chose variables based on their importance. Dividing the data into 80% for training and 20% for testing, a comparative analysis of the predictive performance was conducted, and XGBoost was identified as the optimal model. The validity of the multiple linear regression and XGBoost models derived in this study was confirmed by comparing their predictive performance with prior research.