• Title/Summary/Keyword: statistical prediction

Search Result 1,557, Processing Time 0.027 seconds

Prediction of Key Variables Affecting NBA Playoffs Advancement: Focusing on 3 Points and Turnover Features (미국 프로농구(NBA)의 플레이오프 진출에 영향을 미치는 주요 변수 예측: 3점과 턴오버 속성을 중심으로)

  • An, Sehwan;Kim, Youngmin
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.263-286
    • /
    • 2022
  • This study acquires NBA statistical information for a total of 32 years from 1990 to 2022 using web crawling, observes variables of interest through exploratory data analysis, and generates related derived variables. Unused variables were removed through a purification process on the input data, and correlation analysis, t-test, and ANOVA were performed on the remaining variables. For the variable of interest, the difference in the mean between the groups that advanced to the playoffs and did not advance to the playoffs was tested, and then to compensate for this, the average difference between the three groups (higher/middle/lower) based on ranking was reconfirmed. Of the input data, only this year's season data was used as a test set, and 5-fold cross-validation was performed by dividing the training set and the validation set for model training. The overfitting problem was solved by comparing the cross-validation result and the final analysis result using the test set to confirm that there was no difference in the performance matrix. Because the quality level of the raw data is high and the statistical assumptions are satisfied, most of the models showed good results despite the small data set. This study not only predicts NBA game results or classifies whether or not to advance to the playoffs using machine learning, but also examines whether the variables of interest are included in the major variables with high importance by understanding the importance of input attribute. Through the visualization of SHAP value, it was possible to overcome the limitation that could not be interpreted only with the result of feature importance, and to compensate for the lack of consistency in the importance calculation in the process of entering/removing variables. It was found that a number of variables related to three points and errors classified as subjects of interest in this study were included in the major variables affecting advancing to the playoffs in the NBA. Although this study is similar in that it includes topics such as match results, playoffs, and championship predictions, which have been dealt with in the existing sports data analysis field, and comparatively analyzed several machine learning models for analysis, there is a difference in that the interest features are set in advance and statistically verified, so that it is compared with the machine learning analysis result. Also, it was differentiated from existing studies by presenting explanatory visualization results using SHAP, one of the XAI models.

Machine learning-based corporate default risk prediction model verification and policy recommendation: Focusing on improvement through stacking ensemble model (머신러닝 기반 기업부도위험 예측모델 검증 및 정책적 제언: 스태킹 앙상블 모델을 통한 개선을 중심으로)

  • Eom, Haneul;Kim, Jaeseong;Choi, Sangok
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.105-129
    • /
    • 2020
  • This study uses corporate data from 2012 to 2018 when K-IFRS was applied in earnest to predict default risks. The data used in the analysis totaled 10,545 rows, consisting of 160 columns including 38 in the statement of financial position, 26 in the statement of comprehensive income, 11 in the statement of cash flows, and 76 in the index of financial ratios. Unlike most previous prior studies used the default event as the basis for learning about default risk, this study calculated default risk using the market capitalization and stock price volatility of each company based on the Merton model. Through this, it was able to solve the problem of data imbalance due to the scarcity of default events, which had been pointed out as the limitation of the existing methodology, and the problem of reflecting the difference in default risk that exists within ordinary companies. Because learning was conducted only by using corporate information available to unlisted companies, default risks of unlisted companies without stock price information can be appropriately derived. Through this, it can provide stable default risk assessment services to unlisted companies that are difficult to determine proper default risk with traditional credit rating models such as small and medium-sized companies and startups. Although there has been an active study of predicting corporate default risks using machine learning recently, model bias issues exist because most studies are making predictions based on a single model. Stable and reliable valuation methodology is required for the calculation of default risk, given that the entity's default risk information is very widely utilized in the market and the sensitivity to the difference in default risk is high. Also, Strict standards are also required for methods of calculation. The credit rating method stipulated by the Financial Services Commission in the Financial Investment Regulations calls for the preparation of evaluation methods, including verification of the adequacy of evaluation methods, in consideration of past statistical data and experiences on credit ratings and changes in future market conditions. This study allowed the reduction of individual models' bias by utilizing stacking ensemble techniques that synthesize various machine learning models. This allows us to capture complex nonlinear relationships between default risk and various corporate information and maximize the advantages of machine learning-based default risk prediction models that take less time to calculate. To calculate forecasts by sub model to be used as input data for the Stacking Ensemble model, training data were divided into seven pieces, and sub-models were trained in a divided set to produce forecasts. To compare the predictive power of the Stacking Ensemble model, Random Forest, MLP, and CNN models were trained with full training data, then the predictive power of each model was verified on the test set. The analysis showed that the Stacking Ensemble model exceeded the predictive power of the Random Forest model, which had the best performance on a single model. Next, to check for statistically significant differences between the Stacking Ensemble model and the forecasts for each individual model, the Pair between the Stacking Ensemble model and each individual model was constructed. Because the results of the Shapiro-wilk normality test also showed that all Pair did not follow normality, Using the nonparametric method wilcoxon rank sum test, we checked whether the two model forecasts that make up the Pair showed statistically significant differences. The analysis showed that the forecasts of the Staging Ensemble model showed statistically significant differences from those of the MLP model and CNN model. In addition, this study can provide a methodology that allows existing credit rating agencies to apply machine learning-based bankruptcy risk prediction methodologies, given that traditional credit rating models can also be reflected as sub-models to calculate the final default probability. Also, the Stacking Ensemble techniques proposed in this study can help design to meet the requirements of the Financial Investment Business Regulations through the combination of various sub-models. We hope that this research will be used as a resource to increase practical use by overcoming and improving the limitations of existing machine learning-based models.

A Statistical model to Predict soil Temperature by Combining the Yearly Oscillation Fourier Expansion and Meteorological Factors (연주기(年週期) Fourier 함수(函數)와 기상요소(氣象要素)에 의(依)한 지온예측(地溫豫測) 통계(統計) 모형(模型))

  • Jung, Yeong-Sang;Lee, Byun-Woo;Kim, Byung-Chang;Lee, Yang-Soo;Um, Ki-Tae
    • Korean Journal of Soil Science and Fertilizer
    • /
    • v.23 no.2
    • /
    • pp.87-93
    • /
    • 1990
  • A statistical model to predict soil temperature from the ambient meteorological factors including mean, maximum and minimum air temperatures, precipitation, wind speed and snow depth combined with Fourier time series expansion was developed with the data measured at the Suwon Meteorolical Service from 1979 to 1988. The stepwise elimination technique was used for statistical analysis. For the yearly oscillation model for soil temperature with 8 terms of Fourier expansion, the mean square error was decreased with soil depth showing 2.30 for the surface temperature, and 1.34-0.42 for 5 to 500-cm soil temperatures. The $r^2$ ranged from 0.913 to 0.988. The number of lag days of air temperature by remainder analysis was 0 day for the soil surface temperature, -1 day for 5 to 30-cm soil temperature, and -2 days for 50-cm soil temperature. The number of lag days for precipitaion, snow depth and wind speed was -1 day for the 0 to 10-cm soil temperatures, and -2 to -3 days for the 30 to 50-cm soil teperatures. For the statistical soil temperature prediction model combined with the yearly oscillation terms and meteorological factors as remainder terms considering the lag days obtained above, the mean square error was 1.64 for the soil surfac temperature, and ranged 1.34-0.42 for 5 to 500cm soil temperatures. The model test with 1978 data independent to model development resulted in good agreement with $r^2$ ranged 0.976 to 0.996. The magnitudes of coeffcicients implied that the soil depth where daily meteorological variables night affect soil temperature was 30 to 50 cm. In the models, solar radiation was not included as a independent variable ; however, in a seperated analysis on relationship between the difference(${\Delta}Tmxs$) of the maximum soil temperature and the maximum air temperature and solar radiation(Rs ; $J\;m^{-2}$) under a corn canopy showed linear relationship as $${\Delta}Tmxs=0.902+1.924{\times}10^{-3}$$ Rs for leaf area index lower than 2 $${\Delta}Tmxs=0.274+8.881{\times}10^{-4}$$ Rs for leaf area index higher than 2.

  • PDF

A study on the Degradation and By-products Formation of NDMA by the Photolysis with UV: Setup of Reaction Models and Assessment of Decomposition Characteristics by the Statistical Design of Experiment (DOE) based on the Box-Behnken Technique (UV 공정을 이용한 N-Nitrosodimethylamine (NDMA) 광분해 및 부산물 생성에 관한 연구: 박스-벤켄법 실험계획법을 이용한 통계학적 분해특성평가 및 반응모델 수립)

  • Chang, Soon-Woong;Lee, Si-Jin;Cho, Il-Hyoung
    • Journal of Korean Society of Environmental Engineers
    • /
    • v.32 no.1
    • /
    • pp.33-46
    • /
    • 2010
  • We investigated and estimated at the characteristics of decomposition and by-products of N-Nitrosodimethylamine (NDMA) using a design of experiment (DOE) based on the Box-Behken design in an UV process, and also the main factors (variables) with UV intensity($X_2$) (range: $1.5{\sim}4.5\;mW/cm^2$), NDMA concentration ($X_2$) (range: 100~300 uM) and pH ($X_2$) (rang: 3~9) which consisted of 3 levels in each factor and 4 responses ($Y_1$ (% of NDMA removal), $Y_2$ (dimethylamine (DMA) reformation (uM)), $Y_3$ (dimethylformamide (DMF) reformation (uM), $Y_4$ ($NO_2$-N reformation (uM)) were set up to estimate the prediction model and the optimization conditions. The results of prediction model and optimization point using the canonical analysis in order to obtain the optimal operation conditions were $Y_1$ [% of NDMA removal] = $117+21X_1-0.3X_2-17.2X_3+{2.43X_1}^2+{0.001X_2}^2+{3.2X_3}^2-0.08X_1X_2-1.6X_1X_3-0.05X_2X_3$ ($R^2$= 96%, Adjusted $R^2$ = 88%) and 99.3% ($X_1:\;4.5\;mW/cm^2$, $X_2:\;190\;uM$, $X_3:\;3.2$), $Y_2$ [DMA conc] = $-101+18.5X_1+0.4X_2+21X_3-{3.3X_1}^2-{0.01X_2}^2-{1.5X_3}^2-0.01X_1X_2+0.07X_1X_3-0.01X_2X_3$ ($R^2$= 99.4%, 수정 $R^2$ = 95.7%) and 35.2 uM ($X_1$: 3 $mW/cm^2$, $X_2$: 220 uM, $X_3$: 6.3), $Y_3$ [DMF conc] = $-6.2+0.2X_1+0.02X_2+2X_3-0.26X_1^2-0.01X_2^2-0.2X_3^2-0.004X_1X_2+0.1X_1X_3-0.02X_2X_3$ ($R^2$= 98%, Adjusted $R^2$ = 94.4%) and 3.7 uM ($X_1:\;4.5\;$mW/cm^2$, $X_2:\;290\;uM$, $X_3:\;6.2$) and $Y_4$ [$NO_2$-N conc] = $-25+12.2X_1+0.15X_2+7.8X_3+{1.1X_1}^2+{0.001X_2}^2-{0.34X_3}^2+0.01X_1X_2+0.08X_1X_3-3.4X_2X_3$ ($R^2$= 98.5%, Adjusted $R^2$ = 95.7%) and 74.5 uM ($X_1:\;4.5\;mW/cm^2$, $X_2:\;220\;uM$, $X_3:\;3.1$). This study has demonstrated that the response surface methodology and the Box-Behnken statistical experiment design can provide statistically reliable results for decomposition and by-products of NDMA by the UV photolysis and also for determination of optimum conditions. Predictions obtained from the response functions were in good agreement with the experimental results indicating the reliability of the methodology used.

A Study on Developing a VKOSPI Forecasting Model via GARCH Class Models for Intelligent Volatility Trading Systems (지능형 변동성트레이딩시스템개발을 위한 GARCH 모형을 통한 VKOSPI 예측모형 개발에 관한 연구)

  • Kim, Sun-Woong
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.2
    • /
    • pp.19-32
    • /
    • 2010
  • Volatility plays a central role in both academic and practical applications, especially in pricing financial derivative products and trading volatility strategies. This study presents a novel mechanism based on generalized autoregressive conditional heteroskedasticity (GARCH) models that is able to enhance the performance of intelligent volatility trading systems by predicting Korean stock market volatility more accurately. In particular, we embedded the concept of the volatility asymmetry documented widely in the literature into our model. The newly developed Korean stock market volatility index of KOSPI 200, VKOSPI, is used as a volatility proxy. It is the price of a linear portfolio of the KOSPI 200 index options and measures the effect of the expectations of dealers and option traders on stock market volatility for 30 calendar days. The KOSPI 200 index options market started in 1997 and has become the most actively traded market in the world. Its trading volume is more than 10 million contracts a day and records the highest of all the stock index option markets. Therefore, analyzing the VKOSPI has great importance in understanding volatility inherent in option prices and can afford some trading ideas for futures and option dealers. Use of the VKOSPI as volatility proxy avoids statistical estimation problems associated with other measures of volatility since the VKOSPI is model-free expected volatility of market participants calculated directly from the transacted option prices. This study estimates the symmetric and asymmetric GARCH models for the KOSPI 200 index from January 2003 to December 2006 by the maximum likelihood procedure. Asymmetric GARCH models include GJR-GARCH model of Glosten, Jagannathan and Runke, exponential GARCH model of Nelson and power autoregressive conditional heteroskedasticity (ARCH) of Ding, Granger and Engle. Symmetric GARCH model indicates basic GARCH (1, 1). Tomorrow's forecasted value and change direction of stock market volatility are obtained by recursive GARCH specifications from January 2007 to December 2009 and are compared with the VKOSPI. Empirical results indicate that negative unanticipated returns increase volatility more than positive return shocks of equal magnitude decrease volatility, indicating the existence of volatility asymmetry in the Korean stock market. The point value and change direction of tomorrow VKOSPI are estimated and forecasted by GARCH models. Volatility trading system is developed using the forecasted change direction of the VKOSPI, that is, if tomorrow VKOSPI is expected to rise, a long straddle or strangle position is established. A short straddle or strangle position is taken if VKOSPI is expected to fall tomorrow. Total profit is calculated as the cumulative sum of the VKOSPI percentage change. If forecasted direction is correct, the absolute value of the VKOSPI percentage changes is added to trading profit. It is subtracted from the trading profit if forecasted direction is not correct. For the in-sample period, the power ARCH model best fits in a statistical metric, Mean Squared Prediction Error (MSPE), and the exponential GARCH model shows the highest Mean Correct Prediction (MCP). The power ARCH model best fits also for the out-of-sample period and provides the highest probability for the VKOSPI change direction tomorrow. Generally, the power ARCH model shows the best fit for the VKOSPI. All the GARCH models provide trading profits for volatility trading system and the exponential GARCH model shows the best performance, annual profit of 197.56%, during the in-sample period. The GARCH models present trading profits during the out-of-sample period except for the exponential GARCH model. During the out-of-sample period, the power ARCH model shows the largest annual trading profit of 38%. The volatility clustering and asymmetry found in this research are the reflection of volatility non-linearity. This further suggests that combining the asymmetric GARCH models and artificial neural networks can significantly enhance the performance of the suggested volatility trading system, since artificial neural networks have been shown to effectively model nonlinear relationships.

Predictive Value of Serum ${\beta}-hCG$ Level in Pregnancies following In vitro Fertilization and Embryo Transfer (체외수정시술 후 임신된 환자에서 혈중 ${\beta}-hCG$ 측정에 의한 임신 결과 예측에 관한 연구)

  • Kim, Seok-Hyun;Suh, Chang-Suk;Choi, Doo-Seok;Choi, Young-Min;Shin, Chang-Jae;Kim, Jung-Gu;Moon, Shin-Yong;Lee, Jin-Yong;Chang, Yoon-Seok
    • Clinical and Experimental Reproductive Medicine
    • /
    • v.19 no.1
    • /
    • pp.41-48
    • /
    • 1992
  • Serum level of ${\beta}$ subunit of human chorionic gonadotropin (${\beta}-hCG$) was studied to evaluate its predictability of pregnancy outcome in 98 in vitro fertilization and embryo transfer(IVF-ET) patients using gonadotropin-releasing hormone(GnRH) agonist. Serial serum ${\beta}-hCG$ levels were established for 42 singleton pregnancies, 20 normal multiple pregnancies, 18 preclinical abortions, 14 clinical abortions and 4 ectopic pregnancies. In comparison to normal singleton pregnancies, multiple pregnancies showed significantly higher ${\beta}-hCG$ levels on the post-ET day 10 to 13 and day 24 to 25. Clinical abortions did not show significantly lower ${\beta}-hCG$ levels in early pregnancy except the post-ET day 16-17, but showed significantly lower ${\beta}-hCG$ levels from the post-ET day 22, compared with singleton pregnancies. Preclinical abortions showed significantly lower ${\beta}-hCG$ levels than those of singleton pregnancies. Ectopic pregnancies showed lower ${\beta}-hCG$ levels than those of singleton pregnancies without statistical significance. In conclusion, determination of serum ${\beta}-hCG$ level in early pregnancy is a useful tool for the prediction of preclinical abortions and multiple pregnancies and serial measurement of serum ${\beta}-hCG$ levels will be helpful in predicting clinical abortion.

  • PDF

Developing a Neural-Based Credit Evaluation System with Noisy Data (불량 데이타를 포함한 신경망 신용 평가 시스템의 개발)

  • Kim, Jeong-Won;Choi, Jong-Uk;Choi, Hong-Yun;Chuong, Yoon
    • The Transactions of the Korea Information Processing Society
    • /
    • v.1 no.2
    • /
    • pp.225-236
    • /
    • 1994
  • Many research result conducted by neural network researchers claimed that the degree of generalization of the neural network system is higher or at least equal to that of statistical methods. However, those successful results could be brought only if the neural network was trained by appropriately sound data, having a little of noisy data and being large enough to control noisy data. Real data used in a lot of fields, especially business fields, were not so sound that the network have frequently failed to obtain satisfactory prediction accuracy, the degree of generalization. Enhancing the degree of generalization with noisy data is discussed in this study. The suggestion, which was obtained through a series of experiments, to enhance the degree of generalization is to remove inconsistent data by checking overlapping and inconsistencies. Furthermore, the previous conclusion by other reports is also confirmed that the learning mechanism of neural network takes average value of two inconsistent data included in training set[2]. The interim results of on-going research project are reported in this paper These are ann architecture of the neural network adopted in this project and the whole idea of developing on-line credit evaluation system,being intergration of the expert(resoning)system and the neural network(learning system.Another definite result is corroborated through this study that quickprop,being agopted as a learing algorithm, also has more speedy learning process than does back propagation even in very noisy environment.

  • PDF

Time-Efficient SE(Shielding Effectiveness) Prediction Method for Electrically Large Cavity (전기적으로 큰 공진기의 시간효율적인 차단 효율 계산법)

  • Han, Jun-Yong;Jung, In-Hwan;Lee, Jae-Wook;Lee, Young-Seung;Park, Seung-Keun;Cho, Choon-Sik
    • The Journal of Korean Institute of Electromagnetic Engineering and Science
    • /
    • v.24 no.3
    • /
    • pp.337-347
    • /
    • 2013
  • It is generally well-known that the inevitable high power electromagnetic wave affects the malfunction and disorder of electronic equipments and serious damages in electronic communication systems. Hence, it is necessary to take measures against high power electromagnetic(HPEM) wave for protecting electronic devices as well as human resources. The topological analysis based on Baum-Liu-Tesche(BLT) equation simplifying the moving path of electromagnetic and the observation points and Power Balance Method(PWB) employing statistical electromagnetic analysis are introduced to analyze relatively electrically large cavity with little time consumption. In addition to the PWB method, full wave results for cylindrical cavity with apertures and incident plane wave are presented for comparison with time-consumption rate according to the cavity size.

Investigation of the incidence rate of second grade milk in dairy farms on the central-southern region of Korea (우리나라 중남부지역 젖소목장에서 이등유 발생 조사)

  • Jung, Ji-Young;Yu, Do-Hyeon;Shin, Sung-Shik;Son, Chang-Ho;Oh, Ki-Seok;Hur, Tai-Young;Jung, Young-Hun;Choi, Chang-Yong;Suh, Guk-Hyun
    • Korean Journal of Veterinary Service
    • /
    • v.38 no.3
    • /
    • pp.155-162
    • /
    • 2015
  • The incidence of second-grade milk production in 9 dairy farms of South Korea was investigated from May 2011 to March 2012, and the serum composition of cows producing first- and second-grade milk in 14 farms including the 9 farms was analyzed. The incidence rate of second-grade milk production of 402 cows in nine dairy farms located in the central and southwestern regions of Korea was 15.4% with the highest rate being 34.4%. Seasonal morbidity was higher during late winter (February) and early summer (June) with the highest rate observed in February (32.6%) followed by November (33.3%). Second-grade milk was most frequently found within one month postpartum (34.1%) while only 3.5% was found during the first 60~90 days of lactating period (n=785, 5 herds). The morbidity increased thereafter (P<0.05) with the highest observed between 270~300 days of lactation (36.1%). The acidity was not significantly different between second-grade ($0.159{\pm}0.026%$) and first-grade milk ($0.158{\pm}0.027%$). Blood serum analysis of 371 cows in the 14 dairy farms indicated that aspartate aminotransferase (AST) level was significantly higher (P<0.001) in cows producing second-grade milk while albumin was significantly lower (P<0.001) than cows producing first-grade milk. Total protein and triglyceride was also significantly low along with glucose, non-esterified fatty acid and blood urea nitrogen in cows producing second-grade milk. Statistical analysis including sensitivity, specificity and positive/negative prediction values showed that lactating cows with high AST, low albumin, total protein and triglyceride levels in the serum tended to produce second-grade milk. It was concluded that serological parameters, especially live functional and metabolic-related serum compositions (AST, albumin, total protein and triglyceride), were significantly influenced in cows producing second-grade milk.

The Effect of the Chemical Lateral Boundary Conditions on CMAQ Simulations of Tropospheric Ozone for East Asia (동아시아지역의 CMAQ 대류권 오존 모의에 화학적 측면 경계조건이 미치는 효과)

  • Hong, Sung-Chul;Lee, Jae-Bum;Choi, Jin-Young;Moon, Kyung-Jung;Lee, Hyun-Ju;Hong, You-Deog;Lee, Suk-Jo;Song, Chang-Keun
    • Journal of Korean Society for Atmospheric Environment
    • /
    • v.28 no.5
    • /
    • pp.581-594
    • /
    • 2012
  • The goal of this study is to investigate the effects of the chemical lateral boundary conditions (CLBCs) on Community Multi-scale Air Quality (CMAQ) simulations of tropospheric ozone for East Asia. We developed linking tool to produce CLBCs of CMAQ from Goddard Earth Observing System-Chemistry (GEOS-Chem) as a global chemistry model. We examined two CLBCs: the fixed CLBC in CMAQ (CLBC-CMAQ) and the CLBC from GEOS-Chem (CLBC-GEOS). The ozone fields by CMAQ simulation with these two CLBCs were compared to Tropospheric Emission Spectrometer (TES) satellite data, ozonesonde and surface measurements for May and August in 2008. The results with CLBC-GOES showed a better tropospheric ozone prediction than that with CLBC-CMAQ. The CLBC-GEOS simulation led to the increase in tropospheric ozone concentrations throughout the model domain, due to be influenced high ozone concentrations of upper troposphere and near inflow western and northern boundaries. Statistical evaluations also showed that the CLBC-GEOS case had better results of both the index of Agreement (IOA) and mean normalized bias. In the case of IOA, the CLBC-GEOS simulation was improved about 0.3 compared to CLBC-CMAQ due to the better predictions for high ozone concentrations in upper troposphere.