• Title/Summary/Keyword: cross validation

Search Result 1,006, Processing Time 0.069 seconds

A Study on the Prediction of Uniaxial Compressive Strength Classification Using Slurry TBM Data and Random Forest (이수식 TBM 데이터와 랜덤포레스트를 이용한 일축압축강도 분류 예측에 관한 연구)

  • Tae-Ho Kang;Soon-Wook Choi;Chulho Lee;Soo-Ho Chang
    • Tunnel and Underground Space
    • /
    • v.33 no.6
    • /
    • pp.547-560
    • /
    • 2023
  • Recently, research on predicting ground classification using machine learning techniques, TBM excavation data, and ground data is increasing. In this study, a multi-classification prediction study for uniaxial compressive strength (UCS) was conducted by applying random forest model based on a decision tree among machine learning techniques widely used in various fields to machine data and ground data acquired at three slurry shield TBM sites. For the classification prediction, the training and test data were divided into 7:3, and a grid search including 5-fold cross-validation was used to select the optimal parameter. As a result of classification learning for UCS using a random forest, the accuracy of the multi-classification prediction model was found to be high at both 0.983 and 0.982 in the training set and the test set, respectively. However, due to the imbalance in data distribution between classes, the recall was evaluated low in class 4. It is judged that additional research is needed to increase the amount of measured data of UCS acquired in various sites.

The Optimization of Ensembles for Bankruptcy Prediction (기업부도 예측 앙상블 모형의 최적화)

  • Myoung Jong Kim;Woo Seob Yun
    • Information Systems Review
    • /
    • v.24 no.1
    • /
    • pp.39-57
    • /
    • 2022
  • This paper proposes the GMOPTBoost algorithm to improve the performance of the AdaBoost algorithm for bankruptcy prediction in which class imbalance problem is inherent. AdaBoost algorithm has the advantage of providing a robust learning opportunity for misclassified samples. However, there is a limitation in addressing class imbalance problem because the concept of arithmetic mean accuracy is embedded in AdaBoost algorithm. GMOPTBoost can optimize the geometric mean accuracy and effectively solve the category imbalance problem by applying Gaussian gradient descent. The samples are constructed according to the following two phases. First, five class imbalance datasets are constructed to verify the effect of the class imbalance problem on the performance of the prediction model and the performance improvement effect of GMOPTBoost. Second, class balanced data are constituted through data sampling techniques to verify the performance improvement effect of GMOPTBoost. The main results of 30 times of cross-validation analyzes are as follows. First, the class imbalance problem degrades the performance of ensembles. Second, GMOPTBoost contributes to performance improvements of AdaBoost ensembles trained on imbalanced datasets. Third, Data sampling techniques have a positive impact on performance improvement. Finally, GMOPTBoost contributes to significant performance improvement of AdaBoost ensembles trained on balanced datasets.

MRI Predictors of Malignant Transformation in Patients with Inverted Papilloma: A Decision Tree Analysis Using Conventional Imaging Features and Histogram Analysis of Apparent Diffusion Coefficients

  • Chong Hyun Suh;Jeong Hyun Lee;Mi Sun Chung;Xiao Quan Xu;Yu Sub Sung;Sae Rom Chung;Young Jun Choi;Jung Hwan Baek
    • Korean Journal of Radiology
    • /
    • v.22 no.5
    • /
    • pp.751-758
    • /
    • 2021
  • Objective: Preoperative differentiation between inverted papilloma (IP) and its malignant transformation to squamous cell carcinoma (IP-SCC) is critical for patient management. We aimed to determine the diagnostic accuracy of conventional imaging features and histogram parameters obtained from whole tumor apparent diffusion coefficient (ADC) values to predict IP-SCC in patients with IP, using decision tree analysis. Materials and Methods: In this retrospective study, we analyzed data generated from the records of 180 consecutive patients with histopathologically diagnosed IP or IP-SCC who underwent head and neck magnetic resonance imaging, including diffusion-weighted imaging and 62 patients were included in the study. To obtain whole tumor ADC values, the region of interest was placed to cover the entire volume of the tumor. Classification and regression tree analyses were performed to determine the most significant predictors of IP-SCC among multiple covariates. The final tree was selected by cross-validation pruning based on minimal error. Results: Of 62 patients with IP, 21 (34%) had IP-SCC. The decision tree analysis revealed that the loss of convoluted cerebriform pattern and the 20th percentile cutoff of ADC were the most significant predictors of IP-SCC. With these decision trees, the sensitivity, specificity, accuracy, and C-statistics were 86% (18 out of 21; 95% confidence interval [CI], 65-95%), 100% (41 out of 41; 95% CI, 91-100%), 95% (59 out of 61; 95% CI, 87-98%), and 0.966 (95% CI, 0.912-1.000), respectively. Conclusion: Decision tree analysis using conventional imaging features and histogram analysis of whole volume ADC could predict IP-SCC in patients with IP with high diagnostic accuracy.

A Study on the Drug Classification Using Machine Learning Techniques (머신러닝 기법을 이용한 약물 분류 방법 연구)

  • Anmol Kumar Singh;Ayush Kumar;Adya Singh;Akashika Anshum;Pradeep Kumar Mallick
    • Advanced Industrial SCIence
    • /
    • v.3 no.2
    • /
    • pp.8-16
    • /
    • 2024
  • This paper shows the system of drug classification, the goal of this is to foretell the apt drug for the patients based on their demographic and physiological traits. The dataset consists of various attributes like Age, Sex, BP (Blood Pressure), Cholesterol Level, and Na_to_K (Sodium to Potassium ratio), with the objective to determine the kind of drug being given. The models used in this paper are K-Nearest Neighbors (KNN), Logistic Regression and Random Forest. Further to fine-tune hyper parameters using 5-fold cross-validation, GridSearchCV was used and each model was trained and tested on the dataset. To assess the performance of each model both with and without hyper parameter tuning evaluation metrics like accuracy, confusion matrices, and classification reports were used and the accuracy of the models without GridSearchCV was 0.7, 0.875, 0.975 and with GridSearchCV was 0.75, 1.0, 0.975. According to GridSearchCV Logistic Regression is the most suitable model for drug classification among the three-model used followed by the K-Nearest Neighbors. Also, Na_to_K is an essential feature in predicting the outcome.

A Method for Extracting Equipment Specifications from Plant Documents and Cross-Validation Approach with Similar Equipment Specifications (플랜트 설비 문서로부터 설비사양 추출 및 유사설비 사양 교차 검증 접근법)

  • Jae Hyun Lee;Seungeon Choi;Hyo Won Suh
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.29 no.2
    • /
    • pp.55-68
    • /
    • 2024
  • Plant engineering companies create or refer to requirements documents for each related field, such as plant process/equipment/piping/instrumentation, in different engineering departments. The process-related requirements document includes not only a description of the process but also the requirements of the equipment or related facilities that will operate it. Since the authors and reviewers of the requirements documents are different, there is a possibility that inconsistencies may occur between equipment or parts design specifications described in different requirement documents. Ensuring consistency in these matters can increase the reliability of the overall plant design information. However, the amount of documents and the scattered nature of requirements for a same equipment and parts across different documents make it challenging for engineers to trace and manage requirements. This paper proposes a method to analyze requirement sentences and calculate the similarity of requirement sentences in order to identify semantically identical sentences. To calculate the similarity of requirement sentences, we propose a named entity recognition method to identify compound words for the parts and properties that are semantically central to the requirements. A method to calculate the similarity of the identified compound words for parts and properties is also proposed. The proposed method is explained using sentences in practical documents, and experimental results are described.

Ultrafast MRI and T1 and T2 Radiomics for Predicting Invasive Components in Ductal Carcinoma in Situ Diagnosed With Percutaneous Needle Biopsy

  • Min Young Kim;Heera Yoen;Hye Ji;Sang Joon Park;Sun Mi Kim;Wonshik Han;Nariya Cho
    • Korean Journal of Radiology
    • /
    • v.24 no.12
    • /
    • pp.1190-1199
    • /
    • 2023
  • Objective: This study aimed to investigate the feasibility of ultrafast magnetic resonance imaging (MRI) and radiomic features derived from breast MRI for predicting the upstaging of ductal carcinoma in situ (DCIS) diagnosed using percutaneous needle biopsy. Materials and Methods: Between August 2018 and June 2020, 95 patients with 98 DCIS lesions who underwent preoperative breast MRI, including an ultrafast sequence, and subsequent surgery were included. Four ultrafast MRI parameters were analyzed: time-to-enhancement, maximum slope (MS), area under the curve for 60 s after enhancement, and time-to-peak enhancement. One hundred and seven radiomic features were extracted for the whole tumor on the first post-contrast T1WI and T2WI using PyRadiomics. Clinicopathological characteristics, ultrafast MRI findings, and radiomic features were compared between the pure DCIS and DCIS with invasion groups. Prediction models, incorporating clinicopathological, ultrafast MRI, and radiomic features, were developed. Receiver operating characteristic curve analysis and area under the curve (AUC) were used to evaluate model performance in distinguishing between the two groups using leave-one-out cross-validation. Results: Thirty-six of the 98 lesions (36.7%) were confirmed to have invasive components after surgery. Compared to the pure DCIS group, the DCIS with invasion group had a higher nuclear grade (P < 0.001), larger mean lesion size (P = 0.038), larger mean MS (P = 0.002), and different radiomic-related characteristics, including a more extensive tumor volume; higher maximum gray-level intensity; coarser, more complex, and heterogeneous texture; and a greater concentration of high gray-level intensity. No significant differences in AUCs were found between the model incorporating nuclear grade and lesion size (0.687) and the models integrating additional ultrafast MRI and radiomic features (0.680-0.732). Conclusion: High nuclear grade, larger lesion size, larger MS, and multiple radiomic features were associated with DCIS upstaging. However, the addition of MS and radiomic features to the prediction model did not significantly improve the prediction performance.

Non Destructive Fast Determination of Fatty Acid Composition by Near Infrared Reflectance Spectroscopy in Sesame

  • Kang, Churl-Whan;Kim, Dong-Hwi;Lee, Sung-Woo;Kim, Ki-Jong;Cho, Kyu-Chae;Shim, Kang-Bo
    • KOREAN JOURNAL OF CROP SCIENCE
    • /
    • v.51 no.spc1
    • /
    • pp.283-291
    • /
    • 2006
  • To investigate seed non destructive and fast determination technique utilizing near infrared reflectance spectroscopy (NIRs) for screening ultra high oleic (C18:1) and linoleic (C18:2) fatty acid content sesame varieties among genetic resources and lines of pedigree generations of cross and mutation breeding were carried out in National Institute of Crop Science (NICS). 150 among 378 landraces and introduced cultivars were released to analyse fatty acids by NIRs and gas chromatography (GC). Average content of each fatty acid was 9.64% in palmitic acid (C16:0), 4.73% in stearic acid (C18:0), 42.26% in oleic acid and 43.38% in linoleic acid by GC. The content range of each fatty acid was from 7.29 to 12.27% in palmitic, 6.49% from 2.39 to 8.88% in stearic, 12.59% of wider range compared to that of stearic and palmitic from 37.36 to 49.95% in oleic and of the widest from 30.60 to 47.40% in linoleic acid. Spectrums analyzed by NIRs were distributed from 400 to 2,500 nm wavelengths and varietal distribution of fatty acids were appeared as regular distribution. Varietal differences of oleic acid content good for food processing and human health by NIRs was 14.08% of which 1.49% wider range than that of GC from 38.31 to 52.39%. Varietal differences of linoleic acid content by NIRs was 16.41% of which 0.39% narrower range than that of GC from 30.60 to 47.01%. Varietal differences of oleic and linoleic acid content in NIRs analysis were appeared relatively similar inclination compared with those of GC. Partial least square regression (PLSR) among multiple variant regression (MVR) in NIRs calibration statistics was carried out in spectrum characteristics on the wavelength from 700 to 2,500 nm with oleic and linoleic acids. Correlation coefficient of root square (RSQ) in oleic acid content was 0.724 of which 72.4 percent of sample varieties among all distributed in the range of 0.570 percent of standard error when calibrated (SEC) which were considerably acceptable in statistic confidence significantly for analysis between NIRs and GC. Standard error of cross validation (SECV) of oleic acid was 0.725 of which distributed in the range of 0.725 percent standard error among the samples of mother population between analyzed value by NIRs analysis and analyzed value by GC. RSQ of linoleic acid content was 0.735 of which 73.5 percent of sample varieties among all distributed in the range of 0.643 percent of SEC. SECV of linoleic acid was 0.711 of which distributed in the range of 0.711 percent standard error among the samples of mother population between NIRs analysis and GC analysis. Consequently, adoption NIR analysis for fatty acids of oleic and linoleic instead that of GC was recognized statistically significant between NIRs and GC analysis through not only majority of samples distributed in the range of negligible SEC but also SECV. For enlarging and increasing statistic significance of NIRs analysis, wider range of fatty acids contented sesame germplasm should be kept on releasing additionally for increasing correlation coefficient of RSQ and reducing SEC and SECV in the future.

Exploration of the Multiple Structure of Relational Self and Construct Validation among Korean Adults (한국남녀의 관계적 자아의 특성: 다원적 구성요인 탐색 및 타당성 분석)

  • Ji Kyung Kim;Myoung So Kim
    • Korean Journal of Culture and Social Issue
    • /
    • v.9 no.2
    • /
    • pp.41-59
    • /
    • 2003
  • The present study was conducted to (1) explore the perceptions of Korean men and women about what is an important relationship for them and how do each gender group construe relational self, and (2) develop the scale to assess the factors of relational self and verify construct validity of the scale. 40 college students and 60 adults participated in survey and FGI (Focused Group Interview) respectively, and content analysis of their responses yielded 2 categories with 39 characteristics of relational self. The one category was named 'instrumentality' which was important to men and the other was named 'expressivity' which was important to women. The list of 39 items was administered to a nationwide sample of 1503 Korean adults to assess their construal of relational self through the 6-point Likert scale. Principal axis factor analysis showed that the two categories were unidimensional with high reliability. As a result of factor analysis on each category, a total of 9 factors were extracted. Specifically, the instrumentality consisted of factors such as utilitarianism, independence, initiativeness, self-assurance, and competence. And the factors of expressivity were empathy, passiveness, dependency, consideration. The tests of mean difference revealed that men had higher scores in most of the instrumental factors, while women had higher scores in most of the expressive factors. But there was no sex difference in the interdependent self-construal scale(Cross, 2000) which has been frequently used for measuring relational self. This is related to the Korean's collective cultural characteristics, and it was concluded that the relationship with others is very important to both Korean men and women, but the meaning and expectation of the relationship as well as the method for its preservation are different to each sex group. In addition, the correlation analyses indicated that the feminity score was positively correlated with the expressiveness while the masculinity score was positively correlated with instrumentality. This result implicated the differences of relational self among Korean people were related to the socialization process of each sex, i.e., sex role identity. Finally, limitations of this study and the directions for future research were discussed.

  • PDF

Ensemble Learning with Support Vector Machines for Bond Rating (회사채 신용등급 예측을 위한 SVM 앙상블학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.2
    • /
    • pp.29-45
    • /
    • 2012
  • Bond rating is regarded as an important event for measuring financial risk of companies and for determining the investment returns of investors. As a result, it has been a popular research topic for researchers to predict companies' credit ratings by applying statistical and machine learning techniques. The statistical techniques, including multiple regression, multiple discriminant analysis (MDA), logistic models (LOGIT), and probit analysis, have been traditionally used in bond rating. However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variablesand the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bond rating prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM). Especially, SVM is recognized as a new and promising classification and regression analysis method. SVM learns a separating hyperplane that can maximize the margin between two categories. SVM is simple enough to be analyzed mathematical, and leads to high performance in practical applications. SVM implements the structuralrisk minimization principle and searches to minimize an upper bound of the generalization error. In addition, the solution of SVM may be a global optimum and thus, overfitting is unlikely to occur with SVM. In addition, SVM does not require too many data sample for training since it builds prediction models by only using some representative sample near the boundaries called support vectors. A number of experimental researches have indicated that SVM has been successfully applied in a variety of pattern recognition fields. However, there are three major drawbacks that can be potential causes for degrading SVM's performance. First, SVM is originally proposed for solving binary-class classification problems. Methods for combining SVMs for multi-class classification such as One-Against-One, One-Against-All have been proposed, but they do not improve the performance in multi-class classification problem as much as SVM for binary-class classification. Second, approximation algorithms (e.g. decomposition methods, sequential minimal optimization algorithm) could be used for effective multi-class computation to reduce computation time, but it could deteriorate classification performance. Third, the difficulty in multi-class prediction problems is in data imbalance problem that can occur when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. SVM ensemble learning is one of machine learning methods to cope with the above drawbacks. Ensemble learning is a method for improving the performance of classification and prediction algorithms. AdaBoost is one of the widely used ensemble learning techniques. It constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble's performance is poor. In this way, it can reinforce the training of the misclassified observations of the minority class. This paper proposes a multiclass Geometric Mean-based Boosting (MGM-Boost) to resolve multiclass prediction problem. Since MGM-Boost introduces the notion of geometric mean into AdaBoost, it can perform learning process considering the geometric mean-based accuracy and errors of multiclass. This study applies MGM-Boost to the real-world bond rating case for Korean companies to examine the feasibility of MGM-Boost. 10-fold cross validations for threetimes with different random seeds are performed in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into tenequal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 30 experiments. In the comparison of arithmetic mean-based prediction accuracy between individual classifiers, MGM-Boost (52.95%) shows higher prediction accuracy than both AdaBoost (51.69%) and SVM (49.47%). MGM-Boost (28.12%) also shows the higher prediction accuracy than AdaBoost (24.65%) and SVM (15.42%)in terms of geometric mean-based prediction accuracy. T-test is used to examine whether the performance of each classifiers for 30 folds is significantly different. The results indicate that performance of MGM-Boost is significantly different from AdaBoost and SVM classifiers at 1% level. These results mean that MGM-Boost can provide robust and stable solutions to multi-classproblems such as bond rating.

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.