• Title/Summary/Keyword: Statistical learning model

Search Result 541, Processing Time 0.025 seconds

A fast approximate fitting for mixture of multivariate skew t-distribution via EM algorithm

  • Kim, Seung-Gu
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.2
    • /
    • pp.255-268
    • /
    • 2020
  • A mixture of multivariate canonical fundamental skew t-distribution (CFUST) has been of interest in various fields. In particular, interest in the unsupervised learning society is noteworthy. However, fitting the model via EM algorithm suffers from significant processing time. The main cause is due to the calculation of many multivariate t-cdfs (cumulative distribution functions) in E-step. In this article, we provide an approximate, but fast calculation method for the in univariate fashion, which is the product of successively conditional univariate t-cdfs with Taylor's first order approximation. By replacing all multivariate t-cdfs in E-step with the proposed approximate versions, we obtain the admissible results of fitting the model, where it gives 85% reduction time for the 5 dimensional skewness case of the Australian Institution Sport data set. For this approach, discussions about rough properties, advantages and limits are also presented.

Application of sequence to sequence learning based LSTM model (LSTM-s2s) for forecasting dam inflow (Sequence to Sequence based LSTM (LSTM-s2s)모형을 이용한 댐유입량 예측에 대한 연구)

  • Han, Heechan;Choi, Changhyun;Jung, Jaewon;Kim, Hung Soo
    • Journal of Korea Water Resources Association
    • /
    • v.54 no.3
    • /
    • pp.157-166
    • /
    • 2021
  • Forecasting dam inflow based on high reliability is required for efficient dam operation. In this study, deep learning technique, which is one of the data-driven methods and has been used in many fields of research, was manipulated to predict the dam inflow. The Long Short-Term Memory deep learning with Sequence-to-Sequence model (LSTM-s2s), which provides high performance in predicting time-series data, was applied for forecasting inflow of Soyang River dam. Various statistical metrics or evaluation indicators, including correlation coefficient (CC), Nash-Sutcliffe efficiency coefficient (NSE), percent bias (PBIAS), and error in peak value (PE), were used to evaluate the predictive performance of the model. The result of this study presented that the LSTM-s2s model showed high accuracy in the prediction of dam inflow and also provided good performance for runoff event based runoff prediction. It was found that the deep learning based approach could be used for efficient dam operation for water resource management during wet and dry seasons.

The Binomial Sensitivity Factor Hyper-Geometric Distribution Software Reliability Growth Model for Imperfect Debugging Environment (불완전 디버깅 환경에서의 이항 반응 계수 초기하분포 소프트웨어 신뢰성 성장 모델)

  • Kim, Seong-Hui;Park, Jung-Yang;Park, Jae-Heung
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.4
    • /
    • pp.1103-1111
    • /
    • 2000
  • The hyper-geometric distribution software reliability growth model (HGDM) usually assumes that all the software faults detected are perfectly removed without introducing new faults. However, since new faults can be introduced during the test-and-debug phase, the perfect debugging assumption should be relaxed. In this context, Hou, Kuo and Chang [7] developed a modified HGDM for imperfect debugging environment, assuming tat the learning factor is constant. In this paper we extend the existing imperfect debugging HGDM for tow respects: introduction of random sensitivity factor and allowance of variable learning factor. Then the statistical characteristics of he suggested model are studied and its applications to two real data sets are demonstrated.

  • PDF

Identifying Critical Factors for Successful Games by Applying Topic Modeling

  • Kwak, Mookyung;Park, Ji Su;Shon, Jin Gon
    • Journal of Information Processing Systems
    • /
    • v.18 no.1
    • /
    • pp.130-145
    • /
    • 2022
  • Games are widely used in many fields, but not all games are successful. Then what makes games successful? The question gave us the motivation of this paper, which is to identify critical factors for successful games with topic modeling technique. It is supposed that game reviews written by experts sit on abundant insights and topics of how games succeed. To excavate these insights and topics, latent Dirichlet allocation, a topic modeling analysis technique, was used. This statistical approach provided words that implicate topics behind them. Fifty topics were inferred based on these words, and these topics were categorized by stimulation-response-desiregoal (SRDG) model, which makes a streamlined flow of how players engage in video games. This approach can provide game designers with critical factors for successful games. Furthermore, from this research result, we are going to develop a model for immersive game experiences to explain why some games are more addictive than others and how successful gamification works.

Performance Improvement of a Korean Prosodic Phrase Boundary Prediction Model using Efficient Feature Selection (효율적인 기계학습 자질 선별을 통한 한국어 운율구 경계 예측 모델의 성능 향상)

  • Kim, Min-Ho;Kwon, Hyuk-Chul
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.11
    • /
    • pp.837-844
    • /
    • 2010
  • Prediction of the prosodic phrase boundary is one of the most important natural language processing tasks. We propose, for the natural prediction of the Korean prosodic phrase boundary, a statistical approach incorporating efficient learning features. These new features reflect the factors that affect generation of the prosodic phrase boundary better than existing learning features. Notably, moreover, such learning features, extracted according to the hand-crafted prosodic phrase boundary prediction rule, impart higher accuracy. We developed a statistical model for Korean prosodic phrase boundaries based on the proposed new features. The results were 86.63% accuracy for three levels (major break, minor break, no break) and 81.14% accuracy for six levels (major break with falling tone/rising tone, minor break with falling tone/rising tone/middle tone, no break).

Study on the Reconstruction of Pressure Field in Sloshing Simulation Using Super-Resolution Convolutional Neural Network (심층학습 기반 초해상화 기법을 이용한 슬로싱 압력장 복원에 관한 연구)

  • Kim, Hyo Ju;Yang, Donghun;Park, Jung Yoon;Hwang, Myunggwon;Lee, Sang Bong
    • Journal of the Society of Naval Architects of Korea
    • /
    • v.59 no.2
    • /
    • pp.72-79
    • /
    • 2022
  • Deep-learning-based Super-Resolution (SR) methods were evaluated to reconstruct pressure fields with a high resolution from low-resolution images taken from a coarse grid simulation. In addition to a canonical SRCNN(super-resolution convolutional neural network) model, two modified models from SRCNN, adding an activation function (ReLU or Sigmoid function) to the output layer, were considered in the present study. High resolution images obtained by three models were more vivid and reliable qualitatively, compared with a conventional super-resolution method of bicubic interpolation. A quantitative comparison of statistical similarity showed that SRCNN model with Sigmoid function achieved best performance with less dependency on original resolution of input images.

Learning fair prediction models with an imputed sensitive variable: Empirical studies

  • Kim, Yongdai;Jeong, Hwichang
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.2
    • /
    • pp.251-261
    • /
    • 2022
  • As AI has a wide range of influence on human social life, issues of transparency and ethics of AI are emerging. In particular, it is widely known that due to the existence of historical bias in data against ethics or regulatory frameworks for fairness, trained AI models based on such biased data could also impose bias or unfairness against a certain sensitive group (e.g., non-white, women). Demographic disparities due to AI, which refer to socially unacceptable bias that an AI model favors certain groups (e.g., white, men) over other groups (e.g., black, women), have been observed frequently in many applications of AI and many studies have been done recently to develop AI algorithms which remove or alleviate such demographic disparities in trained AI models. In this paper, we consider a problem of using the information in the sensitive variable for fair prediction when using the sensitive variable as a part of input variables is prohibitive by laws or regulations to avoid unfairness. As a way of reflecting the information in the sensitive variable to prediction, we consider a two-stage procedure. First, the sensitive variable is fully included in the learning phase to have a prediction model depending on the sensitive variable, and then an imputed sensitive variable is used in the prediction phase. The aim of this paper is to evaluate this procedure by analyzing several benchmark datasets. We illustrate that using an imputed sensitive variable is helpful to improve prediction accuracies without hampering the degree of fairness much.

Development of The Irregular Radial Pulse Detection Algorithm Based on Statistical Learning Model (통계적 학습 모형에 기반한 불규칙 맥파 검출 알고리즘 개발)

  • Bae, Jang-Han;Jang, Jun-Su;Ku, Boncho
    • Journal of Biomedical Engineering Research
    • /
    • v.41 no.5
    • /
    • pp.185-194
    • /
    • 2020
  • Arrhythmia is basically diagnosed with the electrocardiogram (ECG) signal, however, ECG is difficult to measure and it requires expert help in analyzing the signal. On the other hand, the radial pulse can be measured with easy and uncomplicated way in daily life, and could be suitable bio-signal for the recent untact paradigm and extensible signal for diagnosis of Korean medicine based on pulse pattern. In this study, we developed an irregular radial pulse detection algorithm based on a learning model and considered its applicability as arrhythmia screening. A total of 1432 pulse waves including irregular pulse data were used in the experiment. Three data sets were prepared with minimal preprocessing to avoid the heuristic feature extraction. As classification algorithms, elastic net logistic regression, random forest, and extreme gradient boosting were applied to each data set and the irregular pulse detection performances were estimated using area under the receiver operating characteristic curve based on a 10-fold cross-validation. The extreme gradient boosting method showed the superior performance than others and found that the classification accuracy reached 99.7%. The results confirmed that the proposed algorithm could be used for arrhythmia screening. To make a fusion technology integrating western and Korean medicine, arrhythmia subtype classification from the perspective of Korean medicine will be needed for future research.

Application of Statistical and Machine Learning Techniques for Habitat Potential Mapping of Siberian Roe Deer in South Korea

  • Lee, Saro;Rezaie, Fatemeh
    • Proceedings of the National Institute of Ecology of the Republic of Korea
    • /
    • v.2 no.1
    • /
    • pp.1-14
    • /
    • 2021
  • The study has been carried out with an objective to prepare Siberian roe deer habitat potential maps in South Korea based on three geographic information system-based models including frequency ratio (FR) as a bivariate statistical approach as well as convolutional neural network (CNN) and long short-term memory (LSTM) as machine learning algorithms. According to field observations, 741 locations were reported as roe deer's habitat preferences. The dataset were divided with a proportion of 70:30 for constructing models and validation purposes. Through FR model, a total of 10 influential factors were opted for the modelling process, namely altitude, valley depth, slope height, topographic position index (TPI), topographic wetness index (TWI), normalized difference water index, drainage density, road density, radar intensity, and morphological feature. The results of variable importance analysis determined that TPI, TWI, altitude and valley depth have higher impact on predicting. Furthermore, the area under the receiver operating characteristic (ROC) curve was applied to assess the prediction accuracies of three models. The results showed that all the models almost have similar performances, but LSTM model had relatively higher prediction ability in comparison to FR and CNN models with the accuracy of 76% and 73% during the training and validation process. The obtained map of LSTM model was categorized into five classes of potentiality including very low, low, moderate, high and very high with proportions of 19.70%, 19.81%, 19.31%, 19.86%, and 21.31%, respectively. The resultant potential maps may be valuable to monitor and preserve the Siberian roe deer habitats.

A Sparse Data Preprocessing Using Support Vector Regression (Support Vector Regression을 이용한 희소 데이터의 전처리)

  • Jun, Sung-Hae;Park, Jung-Eun;Oh, Kyung-Whan
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.6
    • /
    • pp.789-792
    • /
    • 2004
  • In various fields as web mining, bioinformatics, statistical data analysis, and so forth, very diversely missing values are found. These values make training data to be sparse. Largely, the missing values are replaced by predicted values using mean and mode. We can used the advanced missing value imputation methods as conditional mean, tree method, and Markov Chain Monte Carlo algorithm. But general imputation models have the property that their predictive accuracy is decreased according to increase the ratio of missing in training data. Moreover the number of available imputations is limited by increasing missing ratio. To settle this problem, we proposed statistical learning theory to preprocess for missing values. Our statistical learning theory is the support vector regression by Vapnik. The proposed method can be applied to sparsely training data. We verified the performance of our model using the data sets from UCI machine learning repository.