• Title/Summary/Keyword: Random forest algorithm

Search Result 217, Processing Time 0.035 seconds

Clinicoradiological Characteristics in the Differential Diagnosis of Follicular-Patterned Lesions of the Thyroid: A Multicenter Cohort Study

  • Jeong Hoon Lee;Eun Ju Ha;Da Hyun Lee;Miran Han;Jung Hyun Park;Ji-hoon Kim
    • Korean Journal of Radiology
    • /
    • v.23 no.7
    • /
    • pp.763-772
    • /
    • 2022
  • Objective: Preoperative differential diagnosis of follicular-patterned lesions is challenging. This multicenter cohort study investigated the clinicoradiological characteristics relevant to the differential diagnosis of such lesions. Materials and Methods: From June to September 2015, 4787 thyroid nodules (≥ 1.0 cm) with a final diagnosis of benign follicular nodule (BN, n = 4461), follicular adenoma (FA, n = 136), follicular carcinoma (FC, n = 62), or follicular variant of papillary thyroid carcinoma (FVPTC, n = 128) collected from 26 institutions were analyzed. The clinicoradiological characteristics of the lesions were compared among the different histological types using multivariable logistic regression analyses. The relative importance of the characteristics that distinguished histological types was determined using a random forest algorithm. Results: Compared to BN (as the control group), the distinguishing features of follicular-patterned neoplasms (FA, FC, and FVPTC) were patient's age (odds ratio [OR], 0.969 per 1-year increase), lesion diameter (OR, 1.054 per 1-mm increase), presence of solid composition (OR, 2.255), presence of hypoechogenicity (OR, 2.181), and presence of halo (OR, 1.761) (all p < 0.05). Compared to FA (as the control), FC differed with respect to lesion diameter (OR, 1.040 per 1-mm increase) and rim calcifications (OR, 17.054), while FVPTC differed with respect to patient age (OR, 0.966 per 1-year increase), lesion diameter (OR, 0.975 per 1-mm increase), macrocalcifications (OR, 3.647), and non-smooth margins (OR, 2.538) (all p < 0.05). The five important features for the differential diagnosis of follicular-patterned neoplasms (FA, FC, and FVPTC) from BN are maximal lesion diameter, composition, echogenicity, orientation, and patient's age. The most important features distinguishing FC and FVPTC from FA are rim calcifications and macrocalcifications, respectively. Conclusion: Although follicular-patterned lesions have overlapping clinical and radiological features, the distinguishing features identified in our large clinical cohort may provide valuable information for preoperative distinction between them and decision-making regarding their management.

Experimental research on flow regime and transitional criterion of slug to churn-turbulent and churn-turbulent to annular flow in rectangular channels

  • Qingche He;Liang-ming Pan;Luteng Zhang;Wangtao Xu;Meiyue Yan
    • Nuclear Engineering and Technology
    • /
    • v.55 no.11
    • /
    • pp.3973-3982
    • /
    • 2023
  • As for two-phase flow in rectangular channels, the flow regimes especially like churn-turbulent and annular flow are significant for the physical problem like Countercurrent Flow Limitation (CCFL). In this study, the rectangular channels with cross-sections of 4 × 66 mm, 6 × 66 mm, 8 × 66 mm are adopted to investigate the flow regimes of air-water vertical upward two phase flow under adiabatic condition. The gas and liquid superficial velocities are 0 ≤ jg ≤ 20m/s and 0.25 ≤ jf ≤ 3m/s respectively which covering bubbly to annular flow. The flow regimes are identified by random forest algorithm and the flow regime maps are obtained. As the results, the transitional void fraction from slug to churn turbulent flow fluctuate from 0.47 to 0.58 which is significantly affected by the dimensional size of channel and flow rate. Besides, the void fraction at transitional points from churn-turbulent (slug) to annular flow are 0.66-0.67, which are independent with the gap size. Furthermore, a new criteria of slug to churn-turbulent flow is established in this study. In addition, by introducing the interfacial force model, the criteria of churn-turbulent (slug) flow to annular flow is verified.

Protecting Accounting Information Systems using Machine Learning Based Intrusion Detection

  • Biswajit Panja
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.5
    • /
    • pp.111-118
    • /
    • 2024
  • In general network-based intrusion detection system is designed to detect malicious behavior directed at a network or its resources. The key goal of this paper is to look at network data and identify whether it is normal traffic data or anomaly traffic data specifically for accounting information systems. In today's world, there are a variety of principles for detecting various forms of network-based intrusion. In this paper, we are using supervised machine learning techniques. Classification models are used to train and validate data. Using these algorithms we are training the system using a training dataset then we use this trained system to detect intrusion from the testing dataset. In our proposed method, we will detect whether the network data is normal or an anomaly. Using this method we can avoid unauthorized activity on the network and systems under that network. The Decision Tree and K-Nearest Neighbor are applied to the proposed model to classify abnormal to normal behaviors of network traffic data. In addition to that, Logistic Regression Classifier and Support Vector Classification algorithms are used in our model to support proposed concepts. Furthermore, a feature selection method is used to collect valuable information from the dataset to enhance the efficiency of the proposed approach. Random Forest machine learning algorithm is used, which assists the system to identify crucial aspects and focus on them rather than all the features them. The experimental findings revealed that the suggested method for network intrusion detection has a neglected false alarm rate, with the accuracy of the result expected to be between 95% and 100%. As a result of the high precision rate, this concept can be used to detect network data intrusion and prevent vulnerabilities on the network.

Effect of Location Error on the Estimation of Aboveground Biomass Carbon Stock (지상부 바이오매스 탄소저장량의 추정에 위치 오차가 미치는 영향)

  • Kim, Sang-Pil;Heo, Joon;Jung, Jae-Hoon;Yoo, Su-Hong;Kim, Kyoung-Min
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.29 no.2
    • /
    • pp.133-139
    • /
    • 2011
  • Estimation of biomass carbon stock is an important research for estimation of public benefit of forest. Previous studies about biomass carbon stock estimation have limitations, which come from the used deterministic models. The most serious problem of deterministic models is that deterministic models do not provide any explanation about the relevant effects of errors. In this study, the effects of location errors were analyzed in order to estimation of biomass carbon stock of Danyang area using Monte Carlo simulation method. More specifically, the k-Nearest Neighbor(kNN) algorithm was used for basic estimation. In this procedure, random and systematic errors were added on the location of Sample plot, and effects on estimation error were analyzed by checking the changes of RMSE. As a result of random error simulation, mean RMSE of estimation was increased from 24.8 tonC/ha to 26 tonC/ha when 0.5~1 pixel location errors were added. However, mean RMSE was converged after the location errors were added 0.8 pixel, because of characteristic of study site. In case of the systematic error simulation, any significant trends of RMSE were not detected in the test data.

Data-centric XAI-driven Data Imputation of Molecular Structure and QSAR Model for Toxicity Prediction of 3D Printing Chemicals (3D 프린팅 소재 화학물질의 독성 예측을 위한 Data-centric XAI 기반 분자 구조 Data Imputation과 QSAR 모델 개발)

  • ChanHyeok Jeong;SangYoun Kim;SungKu Heo;Shahzeb Tariq;MinHyeok Shin;ChangKyoo Yoo
    • Korean Chemical Engineering Research
    • /
    • v.61 no.4
    • /
    • pp.523-541
    • /
    • 2023
  • As accessibility to 3D printers increases, there is a growing frequency of exposure to chemicals associated with 3D printing. However, research on the toxicity and harmfulness of chemicals generated by 3D printing is insufficient, and the performance of toxicity prediction using in silico techniques is limited due to missing molecular structure data. In this study, quantitative structure-activity relationship (QSAR) model based on data-centric AI approach was developed to predict the toxicity of new 3D printing materials by imputing missing values in molecular descriptors. First, MissForest algorithm was utilized to impute missing values in molecular descriptors of hazardous 3D printing materials. Then, based on four different machine learning models (decision tree, random forest, XGBoost, SVM), a machine learning (ML)-based QSAR model was developed to predict the bioconcentration factor (Log BCF), octanol-air partition coefficient (Log Koa), and partition coefficient (Log P). Furthermore, the reliability of the data-centric QSAR model was validated through the Tree-SHAP (SHapley Additive exPlanations) method, which is one of explainable artificial intelligence (XAI) techniques. The proposed imputation method based on the MissForest enlarged approximately 2.5 times more molecular structure data compared to the existing data. Based on the imputed dataset of molecular descriptor, the developed data-centric QSAR model achieved approximately 73%, 76% and 92% of prediction performance for Log BCF, Log Koa, and Log P, respectively. Lastly, Tree-SHAP analysis demonstrated that the data-centric-based QSAR model achieved high prediction performance for toxicity information by identifying key molecular descriptors highly correlated with toxicity indices. Therefore, the proposed QSAR model based on the data-centric XAI approach can be extended to predict the toxicity of potential pollutants in emerging printing chemicals, chemical process, semiconductor or display process.

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

Calibration of Portable Particulate Mattere-Monitoring Device using Web Query and Machine Learning

  • Loh, Byoung Gook;Choi, Gi Heung
    • Safety and Health at Work
    • /
    • v.10 no.4
    • /
    • pp.452-460
    • /
    • 2019
  • Background: Monitoring and control of PM2.5 are being recognized as key to address health issues attributed to PM2.5. Availability of low-cost PM2.5 sensors made it possible to introduce a number of portable PM2.5 monitors based on light scattering to the consumer market at an affordable price. Accuracy of light scatteringe-based PM2.5 monitors significantly depends on the method of calibration. Static calibration curve is used as the most popular calibration method for low-cost PM2.5 sensors particularly because of ease of application. Drawback in this approach is, however, the lack of accuracy. Methods: This study discussed the calibration of a low-cost PM2.5-monitoring device (PMD) to improve the accuracy and reliability for practical use. The proposed method is based on construction of the PM2.5 sensor network using Message Queuing Telemetry Transport (MQTT) protocol and web query of reference measurement data available at government-authorized PM monitoring station (GAMS) in the republic of Korea. Four machine learning (ML) algorithms such as support vector machine, k-nearest neighbors, random forest, and extreme gradient boosting were used as regression models to calibrate the PMD measurements of PM2.5. Performance of each ML algorithm was evaluated using stratified K-fold cross-validation, and a linear regression model was used as a reference. Results: Based on the performance of ML algorithms used, regression of the output of the PMD to PM2.5 concentrations data available from the GAMS through web query was effective. The extreme gradient boosting algorithm showed the best performance with a mean coefficient of determination (R2) of 0.78 and standard error of 5.0 ㎍/㎥, corresponding to 8% increase in R2 and 12% decrease in root mean square error in comparison with the linear regression model. Minimum 100 hours of calibration period was found required to calibrate the PMD to its full capacity. Calibration method proposed poses a limitation on the location of the PMD being in the vicinity of the GAMS. As the number of the PMD participating in the sensor network increases, however, calibrated PMDs can be used as reference devices to nearby PMDs that require calibration, forming a calibration chain through MQTT protocol. Conclusions: Calibration of a low-cost PMD, which is based on construction of PM2.5 sensor network using MQTT protocol and web query of reference measurement data available at a GAMS, significantly improves the accuracy and reliability of a PMD, thereby making practical use of the low-cost PMD possible.

Variable Selection of Feature Pattern using SVM-based Criterion with Q-Learning in Reinforcement Learning (SVM-기반 제약 조건과 강화학습의 Q-learning을 이용한 변별력이 확실한 특징 패턴 선택)

  • Kim, Chayoung
    • Journal of Internet Computing and Services
    • /
    • v.20 no.4
    • /
    • pp.21-27
    • /
    • 2019
  • Selection of feature pattern gathered from the observation of the RNA sequencing data (RNA-seq) are not all equally informative for identification of differential expressions: some of them may be noisy, correlated or irrelevant because of redundancy in Big-Data sets. Variable selection of feature pattern aims at differential expressed gene set that is significantly relevant for a special task. This issues are complex and important in many domains, for example. In terms of a computational research field of machine learning, selection of feature pattern has been studied such as Random Forest, K-Nearest and Support Vector Machine (SVM). One of most the well-known machine learning algorithms is SVM, which is classical as well as original. The one of a member of SVM-criterion is Support Vector Machine-Recursive Feature Elimination (SVM-RFE), which have been utilized in our research work. We propose a novel algorithm of the SVM-RFE with Q-learning in reinforcement learning for better variable selection of feature pattern. By comparing our proposed algorithm with the well-known SVM-RFE combining Welch' T in published data, our result can show that the criterion from weight vector of SVM-RFE enhanced by Q-learning has been improved by an off-policy by a more exploratory scheme of Q-learning.

Modeling for Egg Price Prediction by Using Machine Learning (기계학습을 활용한 계란가격 예측 모델링)

  • Cho, Hohyun;Lee, Daekyeom;Chae, Yeonghun;Chang, Dongil
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.15-17
    • /
    • 2022
  • In the aftermath of the avian influenza that occurred from the second half of 2020 to the beginning of 2021, 17.8 million laying hens were slaughtered. Although the government invested more than 100 billion won for egg imports as a measure to stabilize prices, the effort was not that easy. The sharp volatility of egg prices negatively affected both consumers and poultry farmers, so measures were needed to stabilize egg prices. To this end, the egg prices were successfully predicted in this study by using the analysis algorithm of a machine learning regression. For price prediction, a total of 8 independent variables, including monthly broiler chicken production statistics for 2012-2021 of the Korean Poultry Association and the slaughter performance of the national statistics portal (kosis), have been selected to be used. The Root Mean Square Error (RMSE), which indicates the difference between the predicted price and the actual price, is at the level of 103 (won), which can be interpreted as explaining the egg prices relatively well predicted. Accurate prediction of egg prices lead to flexible adjustment of egg production weeks for laying hens, which can help decision-making about stocking of laying hens. This result is expected to help secure egg price stability.

  • PDF

Spatial Downscaling of Ocean Colour-Climate Change Initiative (OC-CCI) Forel-Ule Index Using GOCI Satellite Image and Machine Learning Technique (GOCI 위성영상과 기계학습 기법을 이용한 Ocean Colour-Climate Change Initiative (OC-CCI) Forel-Ule Index의 공간 상세화)

  • Sung, Taejun;Kim, Young Jun;Choi, Hyunyoung;Im, Jungho
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.5_1
    • /
    • pp.959-974
    • /
    • 2021
  • Forel-Ule Index (FUI) is an index which classifies the colors of inland and seawater exist in nature into 21 gradesranging from indigo blue to cola brown. FUI has been analyzed in connection with the eutrophication, water quality, and light characteristics of water systems in many studies, and the possibility as a new water quality index which simultaneously contains optical information of water quality parameters has been suggested. In thisstudy, Ocean Colour-Climate Change Initiative (OC-CCI) based 4 km FUI was spatially downscaled to the resolution of 500 m using the Geostationary Ocean Color Imager (GOCI) data and Random Forest (RF) machine learning. Then, the RF-derived FUI was examined in terms of its correlation with various water quality parameters measured in coastal areas and its spatial distribution and seasonal characteristics. The results showed that the RF-derived FUI resulted in higher accuracy (Coefficient of Determination (R2)=0.81, Root Mean Square Error (RMSE)=0.7784) than GOCI-derived FUI estimated by Pitarch's OC-CCI FUI algorithm (R2=0.72, RMSE=0.9708). RF-derived FUI showed a high correlation with five water quality parameters including Total Nitrogen, Total Phosphorus, Chlorophyll-a, Total Suspended Solids, Transparency with the correlation coefficients of 0.87, 0.88, 0.97, 0.65, and -0.98, respectively. The temporal pattern of the RF-derived FUI well reflected the physical relationship with various water quality parameters with a strong seasonality. The research findingssuggested the potential of the high resolution FUI in coastal water quality management in the Korean Peninsula.