Search | Korea Science

Selecting Machine Learning Model Based on Natural Language Processing for Shanghanlun Diagnostic System Classification (자연어 처리 기반 『상한론(傷寒論)』 변병진단체계(辨病診斷體系) 분류를 위한 기계학습 모델 선정)

Young-Nam Kim
- 대한상한금궤의학회지
- /
- v.14 no.1
- /
- pp.41-50
- /
- 2022
Objective : The purpose of this study is to explore the most suitable machine learning model algorithm for Shanghanlun diagnostic system classification using natural language processing (NLP). Methods : A total of 201 data items were collected from 『Shanghanlun』 and 『Clinical Shanghanlun』, 'Taeyangbyeong-gyeolhyung' and 'Eumyangyeokchahunobokbyeong' were excluded to prevent oversampling or undersampling. Data were pretreated using a twitter Korean tokenizer and trained by logistic regression, ridge regression, lasso regression, naive bayes classifier, decision tree, and random forest algorithms. The accuracy of the models were compared. Results : As a result of machine learning, ridge regression and naive Bayes classifier showed an accuracy of 0.843, logistic regression and random forest showed an accuracy of 0.804, and decision tree showed an accuracy of 0.745, while lasso regression showed an accuracy of 0.608. Conclusions : Ridge regression and naive Bayes classifier are suitable NLP machine learning models for the Shanghanlun diagnostic system classification.
PDF

CORRELATION ANALYSIS BETWEEN FOREST VOLUME, ETM+ BANDS, AND HEIGHT ESTIMATED FROM C-BAND SRTM PRODUCT

Kim, Jin-Woo;Kim, Jong-Hong;Lee, Jung-Bin;Heo, Joon
- Proceedings of the KSRS Conference
- /
- v.1
- /
- pp.512-515
- /
- 2006
Forest stand height and volume are important indicators for management purpose as well as for the environmental analysis. Shuttle Radar Topography Mission (SRTM) is backscattered over forest canopy and DSM can be acquired from such scattering characteristic, while National Elevation Dataset (NED) provides bare earth elevation data. The difference between SRTM and NED is estimated as tree height, and it is correlated with forest parameters, it is correlated with forest parameters, including average DBH, Trees per acre, net BF per acre, and total Net MBF. Especially, among them, net Board Foot(BF) per acre is the index that well represents forest volume. The Project site was Douglas-fir dominating plantation area in the western Washington an the northern Oregon in the U.S. This study shows a relationship of high correlation between the forest parameters and the product from SRTM, NED, and ETM+. This research performs multi regression analysis and regression tree algorithm, and can get more improved relationship between several parameters.
PDF

Correlation Analysis Between Forest Volume, ETM+ Bands, and Height Estimated from C-Band SRTM Product

Kim, Jin-Woo;Kim, Jong-Hong;Lee, Jung-Bin;Heo, Joon
- Korean Journal of Remote Sensing
- /
- v.22 no.5
- /
- pp.427-431
- /
- 2006
Forest stand height and volume are important indicators for management purpose as well as for the environmental analysis. Shuttle Radar Topography Mission (SRTM) is backscattered over forest canopy and DSM can be acquired from such scattering characteristic, while National Elevation Dataset (NED) provides bare earth elevation data. The difference between SRTM and NED is estimated as tree height, and it is correlated with forest parameters, it is correlated with forest parameters, including average DBH, Trees per acre, net BF per acre, and total Net MBF. Especially, among them, net Board Foot(BF) per acre is the index that well represents forest volume. The Project site was Douglas-fir dominating plantation area in the western Washington an the northern Oregon in the U.S. This study shows a relationship of high correlation between the forest parameters and the product from SRTM, NED, and ETM+. This research performs multi regression analysis and regression tree algorithm, and can get more improved relationship between several parameters.
https://doi.org/10.7780/kjrs.2006.22.5.427 인용 PDF KSCI

Dynamic Caching Routing Strategy for LEO Satellite Nodes Based on Gradient Boosting Regression Tree

Yang Yang;Shengbo Hu;Guiju Lu
- Journal of Information Processing Systems
- /
- v.20 no.1
- /
- pp.131-147
- /
- 2024
A routing strategy based on traffic prediction and dynamic cache allocation for satellite nodes is proposed to address the issues of high propagation delay and overall delay of inter-satellite and satellite-to-ground links in low Earth orbit (LEO) satellite systems. The spatial and temporal correlations of satellite network traffic were analyzed, and the relevant traffic through the target satellite was extracted as raw input for traffic prediction. An improved gradient boosting regression tree algorithm was used for traffic prediction. Based on the traffic prediction results, a dynamic cache allocation routing strategy is proposed. The satellite nodes periodically monitor the traffic load on inter-satellite links (ISLs) and dynamically allocate cache resources for each ISL with neighboring nodes. Simulation results demonstrate that the proposed routing strategy effectively reduces packet loss rate and average end-to-end delay and improves the distribution of services across the entire network.
https://doi.org/10.3745/JIPS.03.0193 인용 PDF

Malicious URL Detection by Visual Characteristics with Machine Learning: Roles of HTTPS (시각적 특징과 머신 러닝으로 악성 URL 구분: HTTPS의 역할)

Sung-Won HONG;Min-Soo KANG
- Journal of Korea Artificial Intelligence Association
- /
- v.1 no.2
- /
- pp.1-6
- /
- 2023
In this paper, we present a new method for classifying malicious URLs to reduce cases of learning difficulties due to unfamiliar and difficult terms related to information protection. This study plans to extract only visually distinguishable features within the URL structure and compare them through map learning algorithms, and to compare the contribution values of the best map learning algorithm methods to extract features that have the most impact on classifying malicious URLs. As research data, Kaggle used data that classified 7,046 malicious URLs and 7.046 normal URLs. As a result of the study, among the three supervised learning algorithms used (Decision Tree, Support Vector Machine, and Logistic Regression), the Decision Tree algorithm showed the best performance with 83% accuracy, 83.1% F1-score and 83.6% Recall values. It was confirmed that the contribution value of https is the highest among whether to use https, sub domain, and prefix and suffix, which can be visually distinguished through the feature contribution of Decision Tree. Although it has been difficult to learn unfamiliar and difficult terms so far, this study will be able to provide an intuitive judgment method without explanation of the terms and prove its usefulness in the field of malicious URL detection.
https://doi.org/10.24225/jkaia.2023.1.2.1 인용 PDF

Comparison of tree-based ensemble models for regression

Park, Sangho;Kim, Chanmin
- Communications for Statistical Applications and Methods
- /
- v.29 no.5
- /
- pp.561-589
- /
- 2022
When multiple classifications and regression trees are combined, tree-based ensemble models, such as random forest (RF) and Bayesian additive regression trees (BART), are produced. We compare the model structures and performances of various ensemble models for regression settings in this study. RF learns bootstrapped samples and selects a splitting variable from predictors gathered at each node. The BART model is specified as the sum of trees and is calculated using the Bayesian backfitting algorithm. Throughout the extensive simulation studies, the strengths and drawbacks of the two methods in the presence of missing data, high-dimensional data, or highly correlated data are investigated. In the presence of missing data, BART performs well in general, whereas RF provides adequate coverage. The BART outperforms in high dimensional, highly correlated data. However, in all of the scenarios considered, the RF has a shorter computation time. The performance of the two methods is also compared using two real data sets that represent the aforementioned situations, and the same conclusion is reached.
https://doi.org/10.29220/CSAM.2022.29.5.561 인용 PDF KSCI

Identification of Regression Outliers Based on Clustering of LMS-residual Plots

Kim, Bu-Yong;Oh, Mi-Hyun
- Communications for Statistical Applications and Methods
- /
- v.11 no.3
- /
- pp.485-494
- /
- 2004
An algorithm is proposed to identify multiple outliers in linear regression. It is based on the clustering of residuals from the least median of squares estimation. A cut-height criterion for the hierarchical cluster tree is suggested, which yields the optimal clustering of the regression outliers. Comparisons of the effectiveness of the procedures are performed on the basis of the classic data and artificial data sets, and it is shown that the proposed algorithm is superior to the one that is based on the least squares estimation. In particular, the algorithm deals very well with the masking and swamping effects while the other does not.
https://doi.org/10.5351/CKSS.2004.11.3.485 인용 PDF KSCI

A Combinatorial Optimization for Influential Factor Analysis: a Case Study of Political Preference in Korea

Yun, Sung Bum;Yoon, Sanghyun;Heo, Joon
- Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
- /
- v.35 no.5
- /
- pp.415-422
- /
- 2017
Finding influential factors from given clustering result is a typical data science problem. Genetic Algorithm based method is proposed to derive influential factors and its performance is compared with two conventional methods, Classification and Regression Tree (CART) and Chi-Squared Automatic Interaction Detection (CHAID), by using Dunn's index measure. To extract the influential factors of preference towards political parties in South Korea, the vote result of $18^{th}$ presidential election and 'Demographic', 'Health and Welfare', 'Economic' and 'Business' related data were used. Based on the analysis, reverse engineering was implemented. Implementation of reverse engineering based approach for influential factor analysis can provide new set of influential variables which can present new insight towards the data mining field.
https://doi.org/10.7848/ksgpc.2017.35.5.415 인용 PDF KSCI

Regression Trees with. Unbiased Variable Selection (변수선택 편향이 없는 회귀나무를 만들기 위한 알고리즘)

김진흠;김민호
- The Korean Journal of Applied Statistics
- /
- v.17 no.3
- /
- pp.459-473
- /
- 2004
It has well known that an exhaustive search algorithm suggested by Breiman et. a1.(1984) has a trend to select the variable having relatively many possible splits as an splitting rule. We propose an algorithm to overcome this variable selection bias problem and then construct unbiased regression trees based on the algorithm. The proposed algorithm runs two steps of selecting a split variable and determining a split rule for binary split based on the split variable. Simulation studies were performed to compare the proposed algorithm with Breiman et a1.(1984)'s CART(Classification and Regression Tree) in terms of degree of variable selection bias, variable selection power, and MSE(Mean Squared Error). Also, we illustrate the proposed algorithm with real data sets.
https://doi.org/10.5351/KJAS.2004.17.3.459 인용 PDF KSCI

Sequential prediction of TBM penetration rate using a gradient boosted regression tree during tunneling

Lee, Hang-Lo;Song, Ki-Il;Qi, Chongchong;Kim, Kyoung-Yul
- Geomechanics and Engineering
- /
- v.29 no.5
- /
- pp.523-533
- /
- 2022
Several prediction model of penetration rate (PR) of tunnel boring machines (TBMs) have been focused on applying to design stage. In construction stage, however, the expected PR and its trends are changed during tunneling owing to TBM excavation skills and the gap between the investigated and actual geological conditions. Monitoring the PR during tunneling is crucial to rescheduling the excavation plan in real-time. This study proposes a sequential prediction method applicable in the construction stage. Geological and TBM operating data are collected from Gunpo cable tunnel in Korea, and preprocessed through normalization and augmentation. The results show that the sequential prediction for 1 ring unit prediction distance (UPD) is R²≥0.79; whereas, a one-step prediction is R²≤0.30. In modeling algorithm, a gradient boosted regression tree (GBRT) outperformed a least square-based linear regression in sequential prediction method. For practical use, a simple equation between the R² and UPD is proposed. When UPD increases R² decreases exponentially; In particular, UPD at R²=0.60 is calculated as 28 rings using the equation. Such a time interval will provide enough time for decision-making. Evidently, the UPD can be adjusted depending on other project and the R² value targeted by an operator. Therefore, a calculation process for the equation between the R² and UPD is addressed.
https://doi.org/10.12989/gae.2022.29.5.523 인용 KSCI

Search Result 118, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)