• Title/Summary/Keyword: Random Error

Search Result 1,020, Processing Time 0.029 seconds

Development of Decision Tree Software and Protein Profiling using Surface Enhanced laser Desorption/lonization - Time of Flight - Mass Spectrometry (SELDI-TOF-MS) in Papillary Thyroid Cancer (의사결정트리 프로그램 개발 및 갑상선유두암에서 질량분석법을 이용한 단백질 패턴 분석)

  • Yoon, Joon-Kee;Lee, Jun;An, Young-Sil;Park, Bok-Nam;Yoon, Seok-Nam
    • Nuclear Medicine and Molecular Imaging
    • /
    • v.41 no.4
    • /
    • pp.299-308
    • /
    • 2007
  • Purpose: The aim of this study was to develop a bioinformatics software and to test it in serum samples of papillary thyroid cancer using mass spectrometry (SELDI-TOF-MS). Materials and Methods: Development of 'Protein analysis' software performing decision tree analysis was done by customizing C4.5. Sixty-one serum samples from 27 papillary thyroid cancer, 17 autoimmune thyroiditis, 17 controls were applied to 2 types of protein chips, CM10 (weak cation exchange) and IMAC3 (metal binding - Cu). Mass spectrometry was performed to reveal the protein expression profiles. Decision trees were generated using 'Protein analysis' software, and automatically detected biomarker candidates. Validation analysis was performed for CM10 chip by random sampling. Results: Decision tree software, which can perform training and validation from profiling data, was developed. For CM10 and IMAC3 chips, 23 of 113 and 8 of 41 protein peaks were significantly different among 3 groups (p<0.05), respectively. Decision tree correctly classified 3 groups with an error rate of 3.3% for CM10 and 2.0% for IMAC3, and 4 and 7 biomarker candidates were detected respectively. In 2 group comparisons, all cancer samples were correctly discriminated from non-cancer samples (error rate = 0%) for CM10 by single node and for IMAC3 by multiple nodes. Validation results from 5 test sets revealed SELDI-TOF-MS and decision tree correctly differentiated cancers from non-cancers (54/55, 98%), while predictability was moderate in 3 group classification (36/55, 65%). Conclusion: Our in-house software was able to successfully build decision trees and detect biomarker candidates, therefore it could be useful for biomarker discovery and clinical follow up of papillary thyroid cancer.

A Reflectance Normalization Via BRDF Model for the Korean Vegetation using MODIS 250m Data (한반도 식생에 대한 MODIS 250m 자료의 BRDF 효과에 대한 반사도 정규화)

  • Yeom, Jong-Min;Han, Kyung-Soo;Kim, Young-Seup
    • Korean Journal of Remote Sensing
    • /
    • v.21 no.6
    • /
    • pp.445-456
    • /
    • 2005
  • The land surface parameters should be determined with sufficient accuracy, because these play an important role in climate change near the ground. As the surface reflectance presents strong anisotropy, off-nadir viewing results a strong dependency of observations on the Sun - target - sensor geometry. They contribute to the random noise which is produced by surface angular effects. The principal objective of the study is to provide a database of accurate surface reflectance eliminated the angular effects from MODIS 250m reflective channel data over Korea. The MODIS (Moderate Resolution Imaging Spectroradiometer) sensor has provided visible and near infrared channel reflectance at 250m resolution on a daily basis. The successive analytic processing steps were firstly performed on a per-pixel basis to remove cloudy pixels. And for the geometric distortion, the correction process were performed by the nearest neighbor resampling using 2nd-order polynomial obtained from the geolocation information of MODIS Data set. In order to correct the surface anisotropy effects, this paper attempted the semiempirical kernel-driven Bi- directional Reflectance Distribution Function(BRDF) model. The algorithm yields an inversion of the kernel-driven model to the angular components, such as viewing zenith angle, solar zenith angle, viewing azimuth angle, solar azimuth angle from reflectance observed by satellite. First we consider sets of the model observations comprised with a 31-day period to perform the BRDF model. In the next step, Nadir view reflectance normalization is carried out through the modification of the angular components, separated by BRDF model for each spectral band and each pixel. Modeled reflectance values show a good agreement with measured reflectance values and their RMSE(Root Mean Square Error) was totally about 0.01(maximum=0.03). Finally, we provide a normalized surface reflectance database consisted of 36 images for 2001 over Korea.

Retrieval of Hourly Aerosol Optical Depth Using Top-of-Atmosphere Reflectance from GOCI-II and Machine Learning over South Korea (GOCI-II 대기상한 반사도와 기계학습을 이용한 남한 지역 시간별 에어로졸 광학 두께 산출)

  • Seyoung Yang;Hyunyoung Choi;Jungho Im
    • Korean Journal of Remote Sensing
    • /
    • v.39 no.5_3
    • /
    • pp.933-948
    • /
    • 2023
  • Atmospheric aerosols not only have adverse effects on human health but also exert direct and indirect impacts on the climate system. Consequently, it is imperative to comprehend the characteristics and spatiotemporal distribution of aerosols. Numerous research endeavors have been undertaken to monitor aerosols, predominantly through the retrieval of aerosol optical depth (AOD) via satellite-based observations. Nonetheless, this approach primarily relies on a look-up table-based inversion algorithm, characterized by computationally intensive operations and associated uncertainties. In this study, a novel high-resolution AOD direct retrieval algorithm, leveraging machine learning, was developed using top-of-atmosphere reflectance data derived from the Geostationary Ocean Color Imager-II (GOCI-II), in conjunction with their differences from the past 30-day minimum reflectance, and meteorological variables from numerical models. The Light Gradient Boosting Machine (LGBM) technique was harnessed, and the resultant estimates underwent rigorous validation encompassing random, temporal, and spatial N-fold cross-validation (CV) using ground-based observation data from Aerosol Robotic Network (AERONET) AOD. The three CV results consistently demonstrated robust performance, yielding R2=0.70-0.80, RMSE=0.08-0.09, and within the expected error (EE) of 75.2-85.1%. The Shapley Additive exPlanations(SHAP) analysis confirmed the substantial influence of reflectance-related variables on AOD estimation. A comprehensive examination of the spatiotemporal distribution of AOD in Seoul and Ulsan revealed that the developed LGBM model yielded results that are in close concordance with AERONET AOD over time, thereby confirming its suitability for AOD retrieval at high spatiotemporal resolution (i.e., hourly, 250 m). Furthermore, upon comparing data coverage, it was ascertained that the LGBM model enhanced data retrieval frequency by approximately 8.8% in comparison to the GOCI-II L2 AOD products, ameliorating issues associated with excessive masking over very illuminated surfaces that are often encountered in physics-based AOD retrieval processes.

Studies on the Hematology and Blood Chemistry of Korean Cattle Part II. Studies on the Blood Chemistry of Korean Cattle (한국성우(韓國成牛)의 혈액학치(血液學値) 및 혈액화학치(血液化學値)에 관한 연구(硏究) 제2보(第二報) 한국성우(韓國成牛)의 혈액화학치(血液化學値)에 관한 연구(硏究))

  • Cheong, Chang Kook
    • Korean Journal of Veterinary Research
    • /
    • v.5 no.1
    • /
    • pp.97-123
    • /
    • 1965
  • Observations were made on the blood picture of total 196 heads of healthy Korean cattles, including 98 males and females in the purpose of determination of blood chemical values and their sex differences and seasonal variations during one year period from December, 1963 to November, 1964. The blood sampling were scheduled by random in four different seasons and the sample size of both sex included in each season were designated to be same size. The ranges, averages or mean values of the blood glucose, total serum protein, serum globulin, serum albumin, total non-protein nitrogen, blood urea nitrogn, total serum cholesterol, serum inorganic phosphorus and serum calcium were determined in this studies and their respective standard deviation, standard error of means, sex differences and seasonal variations were as follows. 1. The blood glucose values for the male ranged from 32.8 to 70.0 mg/100cc. with a mean of $49.781{\pm}0.823mg/100cc$; for the female the range was 32.0 to 64.0mg/100cc. with a mean of $47.235{\pm}0.782mg/100cc$. Sex difference showed significant at 5% level and seasonal variation was highly significant at 1% level. 2. The total serum protein values for the male ranged from 5.61 to 8.83 gm/100cc with a. mean of $7.366{\pm}0.062gm/100cc$; for the female ranged from 5.53 to 8. 43 gm/100cc. with a mean of $6.832{\pm}0.063gm/100cc$. Sex difference and seasonal variation was not significant. 3. The serum globulin values for the male ranged from 2.97 to 4.78 gm/100cc. with a mean of $3.961{\pm}0.039gm/100cc$.; for the female ranged from 2.87 to 4.41 gm/100cc. with a mean of $3.699{\pm}0.037gm/100cc$. Sex difference showed highly significant at 1% level and seasonal variation was not significant. 4. The serum albumin values for the male ranged from 2.58 to 4.21 gm/100cc. with a mean of $3.405{\pm}0.029gm/100cc$.; for the female ranged from 2.39 to 4.10 gm/100cc. with a mean of $3.204{\pm}0.031gm/100cc$. Sex difference showed highly significant at 1% level and seasonal variation was not significant. 5. The total non-protein nitrogan values for the male ranged from 19.1 to 44.8 gm/100cc. with a mean of $31.166{\pm}0.582mg/100cc$.; for the female the range was 15.2 to 50.5 mg/100cc. with a mean of $28.89.6{\pm}0.673mg/100cc$. Sex difference showed significant at 5% level and seasonal variation was highly significant at 1 % level. 6. The blood urea nitrogen values for the male ranged from 6.4 to 28.3 mg/100cc. with a mean of $13.371{\pm}0.466mg/100cc$.; for the female the range, was 6.0 to 26.9 mg/100cc. with a mean of $13.631{\pm}0.321mg/100cc$. Sex difference was not significant and seasonal variation showed highly significant at 1 % level. 7. The total serum cholesterol values for the male ranged from 60.0 to 238.6 mg/100cc. with a mean of $140.897{\pm}2.826mg/100cc$.; for the female ranged from 50.0 to 243.0 mg/100cc. with a mean of $124.840{\pm}3.553mg/100cc$. Sex difference and seasonal variation showed highly significant at 1% level. 8. The serum inorganic phosphorus values for the male ranged from 3.5 to 7.8 mg/100cc. with a mean of $5.426{\pm}0.096mg/100cc$.; for the female ranged from 3.1 to 8.8 mg/100cc. with a mean of $5.570{\pm}0.128mg/100cc$. Sex difference and seasonal variation showed no significant. 9. The serum calcium values for the male ranged from 7.8 to 12.8 mg/100cc. with a mean of $10.761{\pm}0.102mg/100cc$.; for the female ranged from 8.0 to 13.0 mg/100cc. with a mean of 10. $756{\pm}0.097mg/100cc$. Sex difference was not significant and seasonal variation showed highly significant at 1% level. 10. The age of test group ranged from 2 years to 6 years in both sex and the averageage were, $4.45{\pm}0.114$ years in male and $4.50{\pm}0116$ years in female. Sex difference and seasonal variation of age were not found to be significant.

  • PDF

Studies on the Hematology and Blood Chemistry of Korean Cattle Part I. Studies on the Hematology of Korean Cattle (한국성우(韓國成牛)의 혈액학치(血液學値) 및 혈액화학치(血液化學値)에 관한 연구(硏究) 제1보(第一報) 한국성우(韓國成牛)의 혈액학치(血液學値)에 관한 연구(硏究))

  • Cheong, Chang Kook
    • Korean Journal of Veterinary Research
    • /
    • v.5 no.1
    • /
    • pp.61-96
    • /
    • 1965
  • Observations were made on the blood picture of total 196 heads of healthy Korean cattles, including 98 males and 98 females in the purpose of determination of hematological values and its sex difference, and seasonal variations during one year period from December 1963 to November 1964. The blood sampling were scheduled by random in four different seasons and the sample size of both sex included in each season were designated to be same size. The ranges, averages or mean values of the erythrocytes, hemoglobin, hematocrit, mean corpuscular hemoglobin concentration, total white blood cell count and differential count were determined in this studies and their respective standard deviation, standard error of means, sex defferences and seasonal variations were as follows; 1. The erythrocyte count of male showed a range of $5.0{\times}10^6/c.mm$ to $8.75{\times}10^6/c.mm$ with a mean of $6.5{\pm}0.096{\times}10^6/c.mm$. Female showed a range of $5.0{\times}10^6/c.mm$ to $8.30{\times}10^6/c.mm$, with a mean of 6. $131{\pm}0.078{\times}10^6/c.mm$. There was a highly significant sex difference and seasomal variation was not found to be significant. 2. The hemoglobin value of male showed a range of 9.0g/100cc. to 14.5g/100cc. with a mean of $11.074{\pm}0.143g/100cc$. Female showed a range of 9.0g/100cc to 13.0g/100cc. with a mean of $10.745{\pm}0.034g/100cc$. There was a highly significant sex difference and seasonal variation was not found to be significant. 3. The hematocrit value of male showed a range of 28% to 45% and with a mean of $34.867{\pm}0.468%$. Female showed a range of 28% to 42% with a mean of $32.888{\pm}0.322%$. There was a highly significant sex difference and seasonal variation was not found to be significant. 4. The mean corpuscular hemoglobin of male showed a range of 14.4rr. to 19.6rr. with a mean of $17.1{\pm}0.112rr$. Female showed a range of 14.7rr. to 19.5rr. with a mean of $17.6{\pm}0.113rr$. 5. The mean corpusular volume of male showed a range of $42.5{\mu}^3$ to $62.2{\mu}^3$ with a mean of $53.9{\pm}0.419{\mu}^3$, Female showed a range of $44.2{\mu}^3$ to $60.0{\mu}^3$ with a mean of $53.8{\pm}0.375{\mu}^3$. 6. The mean corpuscular hemoglobin concentration of male showed a range of 28.1 % to 34.9% with a mean of $31.4{\pm}0.161%$. Female showed a range of 28.0% to 34.9% with a mean of $30.9{\pm}0.169%$. 7. The total leucocyte count of male showed a range of 4,000/c.mm to 13,100/c.mm. with a mean of $9,338{\pm}218.23/c.mm$. Female showed a range of 4,000/c.mm. to 14,000/c.mm. with a mean of $9,338{\pm}235.90/c.mm$. Six difference was not found to be significant and there was a highly significant seasonal variation. 8. The differential count of male, the means of neutrophil, stab, segmented cell, Iymphocyte, monocyte, eosinophil and basophil were $31.173{\pm}0.570%$, 0.3%. $30.867{\pm}0.564%$, $55.112{\pm}0.603%$, $3.745{\pm}0.082%$, $9.867{\pm}0.422%$ and 0.14% rspectively. Female showed means of $31.010{\pm}0.572%$, 0.2%, $30.806{\pm}0.569%$, $53.929{\pm}0.634%$, $4.082{\pm}0.109%$, $10.908{\pm}0.503%$ and 0.12% respectively. There were significant sex differences in monocyte and highly significant sex difference in eosinophil, and seasonal variation were found to be highly significant in neutrophil, monocyte and eosinophil. 9. Hematological comparison made between cattles infested with so called "small type piroplasma" and non-infested group. The result of investigation showed no significant difference upon the red blood cell, hemoglobin and hematocrit values between lighty infested group and non-infested group. 10. Age distribution of test group in this study ranged from 2 years to 6 years in both sex and their average age were $4.45{\pm}0.114$(male) and $4.50{\pm}0.116$(female). There found to be no significant sex difference and seasonal variations in the age of test group.

  • PDF

A Meta-analysis of Ambient Air Pollution in Relation to Daily Mortality in Seoul, $1991\sim1995$ (메타분석 방법을 적용한 서울시 대기오염과 조기사망의 상관성 연구 (1991년$\sim$1995년))

  • Dockery, Douglas W.;Kim, Chun-Bae;Jee, Sun-Ha;Chung, Yong;Lee, Jong-Tae
    • Journal of Preventive Medicine and Public Health
    • /
    • v.32 no.2
    • /
    • pp.177-182
    • /
    • 1999
  • Objectives: To reexamine the association between air pollution and daily mortality in Seoul, Korea using a method of meta-analysis with the data filed for 1991 through 1995. Methods: A separate Poisson regression analysis on each district within the metropolitan area of Seoul was conducted to regress daily death counts on levels of each ambient air pollutant, such as total suspended particulates (TSP), sulfur dioxide $(SO_2)$, and ozone $(O_3)$, controlling for variability in the weather condition. We calculated a weighted mean as a meta-analysis summary of the estimates and its standard error. Results: We found that the p value from each pollutant model to test the homogeneity assumption was small (p<0.01) because of the large disparity among district-specific estimates. Therefore, all results reported here were estimated from the random effect model. Using the weighted mean that we calculated, the mortality at a $100{\mu}g/m^3$ increment in a 3-day moving average of TSP levels was 1.034 (95% Cl 1.009-1.059). The mortality was estimated to increase 6% (95% Cl 3-10%) and 3% (95% Cl 0-6%) with each 50 ppb increase for 9-day moving average of SO2 and 1-hr maximum O3, respectively. Conclusions: Like most of air pollution epidemiologic studies, this meta-analysis cannot avoid fleeing from measurement misclassification since no personal measurement was taken. However, we can expect that a measurement bias be reduced in a district-specific estimate since a monitoring station is hefter representative cf air quality of the matched district. The similar results to those from the previous studios indicated existence of health effect of air pollution at current levels in many industrialized countries, including Korea.

  • PDF

Mercury Concentrations of Black-tailed Gull Eggs Depending on the Egg-Laying Order for Marine Environmental Monitoring (연안환경 수은 모니터링용 괭이갈매기 알의 산란순서별 농도 차이)

  • Lee, Jangho;Lee, Jongchun;Jang, Heeyeon;Park, Jong-Hyouk;Choi, Jeong-Heui;Lee, Soo Yong;Shim, Kyuyoung
    • Journal of Environmental Impact Assessment
    • /
    • v.26 no.6
    • /
    • pp.538-552
    • /
    • 2017
  • In this study, total mercury (THg) of Black-tailed Gull (Larus crassirostris) eggs laid on Baengnyeongdo, West Sea of Korea was analyzed in order to compare the THg concentrations of eggs depending on egg-laying order. The first-laid eggs ($mean{\pm}standard$ error, $234.4{\pm}11.2ng/g\;wet$, n=18, t=8.4, p<0.01) significantly had higher THg concentrations than the second-laid eggs ($182.8{\pm}9.1ng/g\;wet$, n=18). Also, the first-laid eggs had higher values in biometrics (length $63.10{\pm}0.49mm$, t=2.4, p<0.05; width $44.51{\pm}0.19mm$, t=4.3, p<0.01; weight $65.53{\pm}0.87g$, t=4.2, p<0.01) than the second-laid eggs (length $62.37{\pm}0.40mm$, width $43.55{\pm}0.17mm$, and weight $62.48{\pm}0.72g$). These differences might be attributed to the amount of food eaten by females relating to males' courtship feeding pattern (males increase courtship feeding rate before the first eggs are laid, and decrease the rate following the laying of the first eggs). Moreover, the lower food intake of females could diminish the quantities of egg albumen that contains a protein binds to most of methylmercury during the period of egg production. Therefore, it is necessary to consistently apply one of egg selection methods (targeted selection (the first-laid egg or the second-laid egg), random selection, and etc.) in one nest for ensuring comparability of mercury concentrations among monitoring sites and monitoring years.

Product Recommender Systems using Multi-Model Ensemble Techniques (다중모형조합기법을 이용한 상품추천시스템)

  • Lee, Yeonjeong;Kim, Kyoung-Jae
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.2
    • /
    • pp.39-54
    • /
    • 2013
  • Recent explosive increase of electronic commerce provides many advantageous purchase opportunities to customers. In this situation, customers who do not have enough knowledge about their purchases, may accept product recommendations. Product recommender systems automatically reflect user's preference and provide recommendation list to the users. Thus, product recommender system in online shopping store has been known as one of the most popular tools for one-to-one marketing. However, recommender systems which do not properly reflect user's preference cause user's disappointment and waste of time. In this study, we propose a novel recommender system which uses data mining and multi-model ensemble techniques to enhance the recommendation performance through reflecting the precise user's preference. The research data is collected from the real-world online shopping store, which deals products from famous art galleries and museums in Korea. The data initially contain 5759 transaction data, but finally remain 3167 transaction data after deletion of null data. In this study, we transform the categorical variables into dummy variables and exclude outlier data. The proposed model consists of two steps. The first step predicts customers who have high likelihood to purchase products in the online shopping store. In this step, we first use logistic regression, decision trees, and artificial neural networks to predict customers who have high likelihood to purchase products in each product group. We perform above data mining techniques using SAS E-Miner software. In this study, we partition datasets into two sets as modeling and validation sets for the logistic regression and decision trees. We also partition datasets into three sets as training, test, and validation sets for the artificial neural network model. The validation dataset is equal for the all experiments. Then we composite the results of each predictor using the multi-model ensemble techniques such as bagging and bumping. Bagging is the abbreviation of "Bootstrap Aggregation" and it composite outputs from several machine learning techniques for raising the performance and stability of prediction or classification. This technique is special form of the averaging method. Bumping is the abbreviation of "Bootstrap Umbrella of Model Parameter," and it only considers the model which has the lowest error value. The results show that bumping outperforms bagging and the other predictors except for "Poster" product group. For the "Poster" product group, artificial neural network model performs better than the other models. In the second step, we use the market basket analysis to extract association rules for co-purchased products. We can extract thirty one association rules according to values of Lift, Support, and Confidence measure. We set the minimum transaction frequency to support associations as 5%, maximum number of items in an association as 4, and minimum confidence for rule generation as 10%. This study also excludes the extracted association rules below 1 of lift value. We finally get fifteen association rules by excluding duplicate rules. Among the fifteen association rules, eleven rules contain association between products in "Office Supplies" product group, one rules include the association between "Office Supplies" and "Fashion" product groups, and other three rules contain association between "Office Supplies" and "Home Decoration" product groups. Finally, the proposed product recommender systems provides list of recommendations to the proper customers. We test the usability of the proposed system by using prototype and real-world transaction and profile data. For this end, we construct the prototype system by using the ASP, Java Script and Microsoft Access. In addition, we survey about user satisfaction for the recommended product list from the proposed system and the randomly selected product lists. The participants for the survey are 173 persons who use MSN Messenger, Daum Caf$\acute{e}$, and P2P services. We evaluate the user satisfaction using five-scale Likert measure. This study also performs "Paired Sample T-test" for the results of the survey. The results show that the proposed model outperforms the random selection model with 1% statistical significance level. It means that the users satisfied the recommended product list significantly. The results also show that the proposed system may be useful in real-world online shopping store.

Comparative evaluation of dose according to changes in rectal gas volume during radiation therapy for cervical cancer : Phantom Study (자궁경부암 방사선치료 시 직장가스 용적 변화에 따른 선량 비교 평가 - Phantom Study)

  • Choi, So Young;Kim, Tae Won;Kim, Min Su;Song, Heung Kwon;Yoon, In Ha;Back, Geum Mun
    • The Journal of Korean Society for Radiation Therapy
    • /
    • v.33
    • /
    • pp.89-97
    • /
    • 2021
  • Purpose: The purpose of this study is to compare and evaluate the dose change according to the gas volume variations in the rectum, which was not included in the treatment plan during radiation therapy for cervical cancer. Materials and methods: Static Intensity Modulated Radiation Therapy (S-IMRT) using a 9-field and Volumetric Modulated Arc Therapy (VMAT) using 2 full-arcs were established with treatment planning system on Computed Tomography images of a human phantom. Random gas parameters were included in the Planning Target Volume(PTV) with a maximum change of 2.0 cm in increments of 0.5 cm. Then, the Conformity Index (CI), Homogeneity Index (HI) and PTV Dmax for the target volume were calculated, and the minimum dose (Dmin), mean dose (Dmean) and Maximum Dose (Dmax) were calculated and compared for OAR(organs at risk). For statistical analysis, T-test was performed to obtain a p-value, where the significance level was set to 0.05. Result: The HI coefficients of determination(R2) of S-IMRT and VMAT were 0.9423 and 0.8223, respectively, indicating a relatively clear correlation, and PTV Dmax was found to increase up to 2.8% as the volume of a given gas parameter increased. In case of OAR evaluation, the dose in the bladder did not change with gas volume while a significant dose difference of more than Dmean 700 cGy was confirmed in rectum using both treatment plans at gas volumes of 1.0 cm or more. In all values except for Dmean of bladder, p-value was less than 0.05, confirming a statistically significant difference. Conclusion: In the case of gas generation not considered in the reference treatment plan, as the amount of gas increased, the dose difference at PTV and the dose delivered to the rectum increased. Therefore, during radiation therapy, it is necessary to make efforts to minimize the dose transmission error caused by a large amount of gas volumes in the rectum. Further studies will be necessary to evaluate dose transmission by not only varying the gas volume but also where the gas was located in the treatment field.

Ensemble Learning with Support Vector Machines for Bond Rating (회사채 신용등급 예측을 위한 SVM 앙상블학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.2
    • /
    • pp.29-45
    • /
    • 2012
  • Bond rating is regarded as an important event for measuring financial risk of companies and for determining the investment returns of investors. As a result, it has been a popular research topic for researchers to predict companies' credit ratings by applying statistical and machine learning techniques. The statistical techniques, including multiple regression, multiple discriminant analysis (MDA), logistic models (LOGIT), and probit analysis, have been traditionally used in bond rating. However, one major drawback is that it should be based on strict assumptions. Such strict assumptions include linearity, normality, independence among predictor variables and pre-existing functional forms relating the criterion variablesand the predictor variables. Those strict assumptions of traditional statistics have limited their application to the real world. Machine learning techniques also used in bond rating prediction models include decision trees (DT), neural networks (NN), and Support Vector Machine (SVM). Especially, SVM is recognized as a new and promising classification and regression analysis method. SVM learns a separating hyperplane that can maximize the margin between two categories. SVM is simple enough to be analyzed mathematical, and leads to high performance in practical applications. SVM implements the structuralrisk minimization principle and searches to minimize an upper bound of the generalization error. In addition, the solution of SVM may be a global optimum and thus, overfitting is unlikely to occur with SVM. In addition, SVM does not require too many data sample for training since it builds prediction models by only using some representative sample near the boundaries called support vectors. A number of experimental researches have indicated that SVM has been successfully applied in a variety of pattern recognition fields. However, there are three major drawbacks that can be potential causes for degrading SVM's performance. First, SVM is originally proposed for solving binary-class classification problems. Methods for combining SVMs for multi-class classification such as One-Against-One, One-Against-All have been proposed, but they do not improve the performance in multi-class classification problem as much as SVM for binary-class classification. Second, approximation algorithms (e.g. decomposition methods, sequential minimal optimization algorithm) could be used for effective multi-class computation to reduce computation time, but it could deteriorate classification performance. Third, the difficulty in multi-class prediction problems is in data imbalance problem that can occur when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. SVM ensemble learning is one of machine learning methods to cope with the above drawbacks. Ensemble learning is a method for improving the performance of classification and prediction algorithms. AdaBoost is one of the widely used ensemble learning techniques. It constructs a composite classifier by sequentially training classifiers while increasing weight on the misclassified observations through iterations. The observations that are incorrectly predicted by previous classifiers are chosen more often than examples that are correctly predicted. Thus Boosting attempts to produce new classifiers that are better able to predict examples for which the current ensemble's performance is poor. In this way, it can reinforce the training of the misclassified observations of the minority class. This paper proposes a multiclass Geometric Mean-based Boosting (MGM-Boost) to resolve multiclass prediction problem. Since MGM-Boost introduces the notion of geometric mean into AdaBoost, it can perform learning process considering the geometric mean-based accuracy and errors of multiclass. This study applies MGM-Boost to the real-world bond rating case for Korean companies to examine the feasibility of MGM-Boost. 10-fold cross validations for threetimes with different random seeds are performed in order to ensure that the comparison among three different classifiers does not happen by chance. For each of 10-fold cross validation, the entire data set is first partitioned into tenequal-sized sets, and then each set is in turn used as the test set while the classifier trains on the other nine sets. That is, cross-validated folds have been tested independently of each algorithm. Through these steps, we have obtained the results for classifiers on each of the 30 experiments. In the comparison of arithmetic mean-based prediction accuracy between individual classifiers, MGM-Boost (52.95%) shows higher prediction accuracy than both AdaBoost (51.69%) and SVM (49.47%). MGM-Boost (28.12%) also shows the higher prediction accuracy than AdaBoost (24.65%) and SVM (15.42%)in terms of geometric mean-based prediction accuracy. T-test is used to examine whether the performance of each classifiers for 30 folds is significantly different. The results indicate that performance of MGM-Boost is significantly different from AdaBoost and SVM classifiers at 1% level. These results mean that MGM-Boost can provide robust and stable solutions to multi-classproblems such as bond rating.