• Title/Summary/Keyword: Statistical Model Validation

Search Result 261, Processing Time 0.024 seconds

A New Look at the Statistical Method for Remote Sensing of Daily Maximum Air Temperature (위성자료를 이용한 일최고온도 산출의 통계적 접근에 관한 고찰)

  • 변민정;한경수;김영섭
    • Korean Journal of Remote Sensing
    • /
    • v.20 no.2
    • /
    • pp.65-76
    • /
    • 2004
  • This study aims to estimate daily maximum air temperature estimated using satellite-derived surface temperature and Elevation Derivative Database (EDD). The analysis is focused on the establishment of a semi-empirical estimation technique of daily maximum air temperature through the multiple regression analysis. This tests the contribution of EDD in the air temperature estimation when it is added into regression model as an independent variable. The better correlation is shown with the EDD data as compared with the correlation without this data set. In order to provide a progressive estimation technique, we propose and compare three approaches: 1) seasonal estimation non-considering landcover, 2) seasonal estimation considering landcover, and 3) estimation according to landcover type and non-considering season. The last method shows the best fit with the root-mean-square error between 0.56$^{\circ}C$ and 3.14$^{\circ}C$. A cross-validation procedure is performed for third method to valid the estimated values for two major landcover types (cropland and forest). For both landcover types, the validation results show reasonable agreement with estimation results. Therefore it is considered that the estimation technique proposed may be applicable to most parts of South Korea.

Evaluation of Correlation between Chlorophyll-a and Multiple Parameters by Multiple Linear Regression Analysis (다중회귀분석을 이용한 낙동강 하류의 Chlorophyll-a 농도와 복합 영향인자들의 상관관계 분석)

  • Lim, Ji-Sung;Kim, Young-Woo;Lee, Jae-Ho;Park, Tae-Joo;Byun, Im-Gyu
    • Journal of Korean Society of Environmental Engineers
    • /
    • v.37 no.5
    • /
    • pp.253-261
    • /
    • 2015
  • In this study, Chlorophyll-a (chl-a) prediction model and multiple parameters affecting algae occurrence in Mulgeum site were evaluated by statistical analysis using water quality, hydraulic and climate data at Mulgeum site (1998~2008). Before the analysis, control chart method and effect period of typhoon were adopted for improving reliability of the data. After data preprocessing step two methods were used in this study. In method 1, chl-a prediction model was developed using preprocessed data. Another model was developed by Method 2 using significant parameters affecting chl-a after data preprocessing step. As a result of correlation analysis, water temperature, pH, DO, BOD, COD, T-N, $NO_3-N$, $PO_4-P$, flow rate, flow velocity and water depth were revealed as significant multiple parameters affecting chl-a concentration. Chl-a prediction model from Method 1 and 2 showed high $R^2$ value with 0.799 and 0.790 respectively. Validation for each prediction model was conducted with the data from 2009 to 2010. Training period and validation period of Method 1 showed 20.912 and 24.423 respectively. And Method 2 showed 21.422 and 26.277 in each period. Especially BOD, DO and $PO_4-P$ played important role in both model. So it is considered that analysis of algae occurrence at Mulgeum site need to focus on BOD, DO and $PO_4-P$.

Prediction of Key Variables Affecting NBA Playoffs Advancement: Focusing on 3 Points and Turnover Features (미국 프로농구(NBA)의 플레이오프 진출에 영향을 미치는 주요 변수 예측: 3점과 턴오버 속성을 중심으로)

  • An, Sehwan;Kim, Youngmin
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.263-286
    • /
    • 2022
  • This study acquires NBA statistical information for a total of 32 years from 1990 to 2022 using web crawling, observes variables of interest through exploratory data analysis, and generates related derived variables. Unused variables were removed through a purification process on the input data, and correlation analysis, t-test, and ANOVA were performed on the remaining variables. For the variable of interest, the difference in the mean between the groups that advanced to the playoffs and did not advance to the playoffs was tested, and then to compensate for this, the average difference between the three groups (higher/middle/lower) based on ranking was reconfirmed. Of the input data, only this year's season data was used as a test set, and 5-fold cross-validation was performed by dividing the training set and the validation set for model training. The overfitting problem was solved by comparing the cross-validation result and the final analysis result using the test set to confirm that there was no difference in the performance matrix. Because the quality level of the raw data is high and the statistical assumptions are satisfied, most of the models showed good results despite the small data set. This study not only predicts NBA game results or classifies whether or not to advance to the playoffs using machine learning, but also examines whether the variables of interest are included in the major variables with high importance by understanding the importance of input attribute. Through the visualization of SHAP value, it was possible to overcome the limitation that could not be interpreted only with the result of feature importance, and to compensate for the lack of consistency in the importance calculation in the process of entering/removing variables. It was found that a number of variables related to three points and errors classified as subjects of interest in this study were included in the major variables affecting advancing to the playoffs in the NBA. Although this study is similar in that it includes topics such as match results, playoffs, and championship predictions, which have been dealt with in the existing sports data analysis field, and comparatively analyzed several machine learning models for analysis, there is a difference in that the interest features are set in advance and statistically verified, so that it is compared with the machine learning analysis result. Also, it was differentiated from existing studies by presenting explanatory visualization results using SHAP, one of the XAI models.

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.

Development and validation of Accident Modification Factors of Two-Lane Rural Roadways (지방부 2차로 도로의 사고예측계수 개발 및 검증)

  • Kim, Eung-Cheol;Choe, Eun-Jin;Lee, Dong-Min;Kim, Do-Hun
    • Journal of Korean Society of Transportation
    • /
    • v.28 no.3
    • /
    • pp.131-143
    • /
    • 2010
  • This study has aimed to develop accident modification factor(AMF) for rural two-lane roadway segments. Accident Modification Factor is a coefficient to assess roadway safety as reflecting characteristics of homogeneous roadway. It estimates accident frequency of roadway segments with developed base model and exposure. We found on items of such factors as crosswalk, driveway density, topography characteristic, land use and median through statistical models and literature review. To develop accident modification factors, we used statistical model methods and analyses of applicability and expert judgement method were practiced to validate it. Although expert judgement for land use item was questionable, most items were rated acceptable. Result of comparative analysis revealed crash frequencies of IHSDM and KHSEM were most similar with actual. However, accident distribution of KHSEM was more proper than IHSDM. Also overall estimated values of RSDS were found to be overestimated.

Implementation on the evolutionary machine learning approaches for streamflow forecasting: case study in the Seybous River, Algeria (유출예측을 위한 진화적 기계학습 접근법의 구현: 알제리 세이보스 하천의 사례연구)

  • Zakhrouf, Mousaab;Bouchelkia, Hamid;Stamboul, Madani;Kim, Sungwon;Singh, Vijay P.
    • Journal of Korea Water Resources Association
    • /
    • v.53 no.6
    • /
    • pp.395-408
    • /
    • 2020
  • This paper aims to develop and apply three different machine learning approaches (i.e., artificial neural networks (ANN), adaptive neuro-fuzzy inference systems (ANFIS), and wavelet-based neural networks (WNN)) combined with an evolutionary optimization algorithm and the k-fold cross validation for multi-step (days) streamflow forecasting at the catchment located in Algeria, North Africa. The ANN and ANFIS models yielded similar performances, based on four different statistical indices (i.e., root mean squared error (RMSE), Nash-Sutcliffe efficiency (NSE), correlation coefficient (R), and peak flow criteria (PFC)) for training and testing phases. The values of RMSE and PFC for the WNN model (e.g., RMSE = 8.590 ㎥/sec, PFC = 0.252 for (t+1) day, testing phase) were lower than those of ANN (e.g., RMSE = 19.120 ㎥/sec, PFC = 0.446 for (t+1) day, testing phase) and ANFIS (e.g., RMSE = 18.520 ㎥/sec, PFC = 0.444 for (t+1) day, testing phase) models, while the values of NSE and R for WNN model were higher than those of ANNs and ANFIS models. Therefore, the new approach can be a robust tool for multi-step (days) streamflow forecasting in the Seybous River, Algeria.

Uncertainty Calculation Algorithm for the Estimation of the Radiochronometry of Nuclear Material (핵물질 연대측정을 위한 불확도 추정 알고리즘 연구)

  • JaeChan Park;TaeHoon Jeon;JungHo Song;MinSu Ju;JinYoung Chung;KiNam Kwon;WooChul Choi;JaeHak Cheong
    • Journal of Radiation Industry
    • /
    • v.17 no.4
    • /
    • pp.345-357
    • /
    • 2023
  • Nuclear forensics has been understood as a mendatory component in the international society for nuclear material control and non-proliferation verification. Radiochronometry of nuclear activities for nuclear forensics are decay series characteristics of nuclear materials and the Bateman equation to estimate when nuclear materials were purified and produced. Radiochronometry values have uncertainty of measurement due to the uncertainty factors in the estimation process. These uncertainties should be calculated using appropriate evaluation methods that are representative of the accuracy and reliability. The IAEA, US, and EU have been researched on radiochronometry and uncertainty of measurement, although the uncertainty calculation method using the Bateman equation is limited by the underestimation of the decay constant and the impossibility of estimating the age of more than one generation, so it is necessary to conduct uncertainty calculation research using computer simulation such as Monte Carlo method. This highlights the need for research using computational simulations, such as the Monte Carlo method, to overcome these limitations. In this study, we have analyzed mathematical models and the LHS (Latin Hypercube Sampling) methods to enhance the reliability of radiochronometry which is to develop an uncertainty algorithm for nuclear material radiochronometry using Bateman Equation. We analyzed the LHS method, which can obtain effective statistical results with a small number of samples, and applied it to algorithms that are Monte Carlo methods for uncertainty calculation by computer simulation. This was implemented through the MATLAB computational software. The uncertainty calculation model using mathematical models demonstrated characteristics based on the relationship between sensitivity coefficients and radiative equilibrium. Computational simulation random sampling showed characteristics dependent on random sampling methods, sampling iteration counts, and the probability distribution of uncertainty factors. For validation, we compared models from various international organizations, mathematical models, and the Monte Carlo method. The developed algorithm was found to perform calculations at an equivalent level of accuracy compared to overseas institutions and mathematical model-based methods. To enhance usability, future research and comparisons·validations need to incorporate more complex decay chains and non-homogeneous conditions. The results of this study can serve as foundational technology in the nuclear forensics field, providing tools for the identification of signature nuclides and aiding in the research, development, comparison, and validation of related technologies.

Fatigue life prediction based on Bayesian approach to incorporate field data into probability model

  • An, Dawn;Choi, Joo-Ho;Kim, Nam H.;Pattabhiraman, Sriram
    • Structural Engineering and Mechanics
    • /
    • v.37 no.4
    • /
    • pp.427-442
    • /
    • 2011
  • In fatigue life design of mechanical components, uncertainties arising from materials and manufacturing processes should be taken into account for ensuring reliability. A common practice is to apply a safety factor in conjunction with a physics model for evaluating the lifecycle, which most likely relies on the designer's experience. Due to conservative design, predictions are often in disagreement with field observations, which makes it difficult to schedule maintenance. In this paper, the Bayesian technique, which incorporates the field failure data into prior knowledge, is used to obtain a more dependable prediction of fatigue life. The effects of prior knowledge, noise in data, and bias in measurements on the distribution of fatigue life are discussed in detail. By assuming a distribution type of fatigue life, its parameters are identified first, followed by estimating the distribution of fatigue life, which represents the degree of belief of the fatigue life conditional to the observed data. As more data are provided, the values will be updated to reduce the credible interval. The results can be used in various needs such as a risk analysis, reliability based design optimization, maintenance scheduling, or validation of reliability analysis codes. In order to obtain the posterior distribution, the Markov Chain Monte Carlo technique is employed, which is a modern statistical computational method which effectively draws the samples of the given distribution. Field data of turbine components are exploited to illustrate our approach, which counts as a regular inspection of the number of failed blades in a turbine disk.

Prediction of BaP and Total PAH in Soil from Pyr Concentration using Regression Analysis (회귀분석을 통한 토양 내 Pyr 농도로부터 BaP와 총 PAH의 예측기법)

  • Lee, Woo-Bum;Kim, Jongo
    • Journal of Korean Society of Environmental Engineers
    • /
    • v.39 no.3
    • /
    • pp.118-123
    • /
    • 2017
  • This study investigated the feasibility of a statistical approach for the prediction of BaP and total PAHs as pyrogenic sources. As results of regression, excellent linear and multiple correlations ($r^2$ > 0.94) were observed between BaP (or ${\Sigma}PAH$) and Pyr concentrations. When a developed prediction equation was applied to other investigations as validation and application studies, outstanding prediction results were obtained. The predictive model showed very good correlation between the measured and calculated ${\Sigma}PAH$. From this equation, Pyr was an apparently important hydrocarbon for the prediction of PAH. This model might provide a potentially useful tool for the calculation of average BaP and ${\Sigma}PAH$ in a certain region without additional tests.

Improvement of a Context-aware Recommender System through User's Emotional State Prediction (사용자 감정 예측을 통한 상황인지 추천시스템의 개선)

  • Ahn, Hyunchul
    • Journal of Information Technology Applications and Management
    • /
    • v.21 no.4
    • /
    • pp.203-223
    • /
    • 2014
  • This study proposes a novel context-aware recommender system, which is designed to recommend the items according to the customer's responses to the previously recommended item. In specific, our proposed system predicts the user's emotional state from his or her responses (such as facial expressions and movements) to the previous recommended item, and then it recommends the items that are similar to the previous one when his or her emotional state is estimated as positive. If the customer's emotional state on the previously recommended item is regarded as negative, the system recommends the items that have characteristics opposite to the previous item. Our proposed system consists of two sub modules-(1) emotion prediction module, and (2) responsive recommendation module. Emotion prediction module contains the emotion prediction model that predicts a customer's arousal level-a physiological and psychological state of being awake or reactive to stimuli-using the customer's reaction data including facial expressions and body movements, which can be measured using Microsoft's Kinect Sensor. Responsive recommendation module generates a recommendation list by using the results from the first module-emotion prediction module. If a customer shows a high level of arousal on the previously recommended item, the module recommends the items that are most similar to the previous item. Otherwise, it recommends the items that are most dissimilar to the previous one. In order to validate the performance and usefulness of the proposed recommender system, we conducted empirical validation. In total, 30 undergraduate students participated in the experiment. We used 100 trailers of Korean movies that had been released from 2009 to 2012 as the items for recommendation. For the experiment, we manually constructed Korean movie trailer DB which contains the fields such as release date, genre, director, writer, and actors. In order to check if the recommendation using customers' responses outperforms the recommendation using their demographic information, we compared them. The performance of the recommendation was measured using two metrics-satisfaction and arousal levels. Experimental results showed that the recommendation using customers' responses (i.e. our proposed system) outperformed the recommendation using their demographic information with statistical significance.