• Title/Summary/Keyword: feature subset selection

Search Result 85, Processing Time 0.022 seconds

Optimization of Support Vector Machines for Financial Forecasting (재무예측을 위한 Support Vector Machine의 최적화)

  • Kim, Kyoung-Jae;Ahn, Hyun-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.4
    • /
    • pp.241-254
    • /
    • 2011
  • Financial time-series forecasting is one of the most important issues because it is essential for the risk management of financial institutions. Therefore, researchers have tried to forecast financial time-series using various data mining techniques such as regression, artificial neural networks, decision trees, k-nearest neighbor etc. Recently, support vector machines (SVMs) are popularly applied to this research area because they have advantages that they don't require huge training data and have low possibility of overfitting. However, a user must determine several design factors by heuristics in order to use SVM. For example, the selection of appropriate kernel function and its parameters and proper feature subset selection are major design factors of SVM. Other than these factors, the proper selection of instance subset may also improve the forecasting performance of SVM by eliminating irrelevant and distorting training instances. Nonetheless, there have been few studies that have applied instance selection to SVM, especially in the domain of stock market prediction. Instance selection tries to choose proper instance subsets from original training data. It may be considered as a method of knowledge refinement and it maintains the instance-base. This study proposes the novel instance selection algorithm for SVMs. The proposed technique in this study uses genetic algorithm (GA) to optimize instance selection process with parameter optimization simultaneously. We call the model as ISVM (SVM with Instance selection) in this study. Experiments on stock market data are implemented using ISVM. In this study, the GA searches for optimal or near-optimal values of kernel parameters and relevant instances for SVMs. This study needs two sets of parameters in chromosomes in GA setting : The codes for kernel parameters and for instance selection. For the controlling parameters of the GA search, the population size is set at 50 organisms and the value of the crossover rate is set at 0.7 while the mutation rate is 0.1. As the stopping condition, 50 generations are permitted. The application data used in this study consists of technical indicators and the direction of change in the daily Korea stock price index (KOSPI). The total number of samples is 2218 trading days. We separate the whole data into three subsets as training, test, hold-out data set. The number of data in each subset is 1056, 581, 581 respectively. This study compares ISVM to several comparative models including logistic regression (logit), backpropagation neural networks (ANN), nearest neighbor (1-NN), conventional SVM (SVM) and SVM with the optimized parameters (PSVM). In especial, PSVM uses optimized kernel parameters by the genetic algorithm. The experimental results show that ISVM outperforms 1-NN by 15.32%, ANN by 6.89%, Logit and SVM by 5.34%, and PSVM by 4.82% for the holdout data. For ISVM, only 556 data from 1056 original training data are used to produce the result. In addition, the two-sample test for proportions is used to examine whether ISVM significantly outperforms other comparative models. The results indicate that ISVM outperforms ANN and 1-NN at the 1% statistical significance level. In addition, ISVM performs better than Logit, SVM and PSVM at the 5% statistical significance level.

Automatic pronunciation assessment of English produced by Korean learners using articulatory features (조음자질을 이용한 한국인 학습자의 영어 발화 자동 발음 평가)

  • Ryu, Hyuksu;Chung, Minhwa
    • Phonetics and Speech Sciences
    • /
    • v.8 no.4
    • /
    • pp.103-113
    • /
    • 2016
  • This paper aims to propose articulatory features as novel predictors for automatic pronunciation assessment of English produced by Korean learners. Based on the distinctive feature theory, where phonemes are represented as a set of articulatory/phonetic properties, we propose articulatory Goodness-Of-Pronunciation(aGOP) features in terms of the corresponding articulatory attributes, such as nasal, sonorant, anterior, etc. An English speech corpus spoken by Korean learners is used in the assessment modeling. In our system, learners' speech is forced aligned and recognized by using the acoustic and pronunciation models derived from the WSJ corpus (native North American speech) and the CMU pronouncing dictionary, respectively. In order to compute aGOP features, articulatory models are trained for the corresponding articulatory attributes. In addition to the proposed features, various features which are divided into four categories such as RATE, SEGMENT, SILENCE, and GOP are applied as a baseline. In order to enhance the assessment modeling performance and investigate the weights of the salient features, relevant features are extracted by using Best Subset Selection(BSS). The results show that the proposed model using aGOP features outperform the baseline. In addition, analysis of relevant features extracted by BSS reveals that the selected aGOP features represent the salient variations of Korean learners of English. The results are expected to be effective for automatic pronunciation error detection, as well.

k-Nearest Neighbor-Based Approach for the Estimation of Mutual Information (상호정보 추정을 위한 k-최근접이웃 기반방법)

  • Cha, Woon-Ock;Huh, Moon-Yul
    • Communications for Statistical Applications and Methods
    • /
    • v.15 no.6
    • /
    • pp.977-991
    • /
    • 2008
  • This study is about the k-nearest neighbor-based approach for the estimation of mutual information when the type of target variable is categorical and continuous. The results of Monte-Carlo simulation and experiments with real-world data show that k=1 is preferable. In practical application with real world data, our study shows that jittering and bootstrapping is needed.

Statistical Analysis for Feature Subset Selection Procedures.

  • Kim, In-Young;Lee, Sun-Ho;Kim, Sang-Cheol;Rha, Sun-Young;Chung, Hyun-Cheol;Kim, Byung-Soo
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2003.10a
    • /
    • pp.101-106
    • /
    • 2003
  • In this paper, we propose using Hotelling's T2 statistic for the detection of a set of a set of differentially expressed (DE) genes in colorectal cancer based on its gene expression level in tumor tissues compared with those in normal tissues and to evaluate its predictivity which let us rank genes for the development of biomarkers for population screening of colorectal cancer. We compared the prediction rate based on the DE genes selected by Hotelling's T2 statistic and univariate t statistic using various prediction methods, a regulized discrimination analysis and a support vector machine. The result shows that the prediction rate based on T2 is better than that of univatiate t. This implies that it may not be sufficient to look at each gene in a separate universe and that evaluating combinations of genes reveals interesting information that will not be discovered otherwise.

  • PDF

The Generation of Control Rules for Data Mining (데이터 마이닝을 위한 제어규칙의 생성)

  • Park, In-Kyoo
    • Journal of Digital Convergence
    • /
    • v.11 no.11
    • /
    • pp.343-349
    • /
    • 2013
  • Rough set theory comes to derive optimal rules through the effective selection of features from the redundancy of lots of information in data mining using the concept of equivalence relation and approximation space in rough set. The reduction of attributes is one of the most important parts in its applications of rough set. This paper purports to define a information-theoretic measure for determining the most important attribute within the association of attributes using rough entropy. The proposed method generates the effective reduct set and formulates the core of the attribute set through the elimination of the redundant attributes. Subsequently, the control rules are generated with a subset of feature which retain the accuracy of the original features through the reduction.

Feature Selection to Predict Very Short-term Heavy Rainfall Based on Differential Evolution (미분진화 기반의 초단기 호우예측을 위한 특징 선택)

  • Seo, Jae-Hyun;Lee, Yong Hee;Kim, Yong-Hyuk
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.22 no.6
    • /
    • pp.706-714
    • /
    • 2012
  • The Korea Meteorological Administration provided the recent four-years records of weather dataset for our very short-term heavy rainfall prediction. We divided the dataset into three parts: train, validation and test set. Through feature selection, we select only important features among 72 features to avoid significant increase of solution space that arises when growing exponentially with the dimensionality. We used a differential evolution algorithm and two classifiers as the fitness function of evolutionary computation to select more accurate feature subset. One of the classifiers is Support Vector Machine (SVM) that shows high performance, and the other is k-Nearest Neighbor (k-NN) that is fast in general. The test results of SVM were more prominent than those of k-NN in our experiments. Also we processed the weather data using undersampling and normalization techniques. The test results of our differential evolution algorithm performed about five times better than those using all features and about 1.36 times better than those using a genetic algorithm, which is the best known. Running times when using a genetic algorithm were about twenty times longer than those when using a differential evolution algorithm.

Short-Term Prediction of Vehicle Speed on Main City Roads using the k-Nearest Neighbor Algorithm (k-Nearest Neighbor 알고리즘을 이용한 도심 내 주요 도로 구간의 교통속도 단기 예측 방법)

  • Rasyidi, Mohammad Arif;Kim, Jeongmin;Ryu, Kwang Ryel
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.1
    • /
    • pp.121-131
    • /
    • 2014
  • Traffic speed is an important measure in transportation. It can be employed for various purposes, including traffic congestion detection, travel time estimation, and road design. Consequently, accurate speed prediction is essential in the development of intelligent transportation systems. In this paper, we present an analysis and speed prediction of a certain road section in Busan, South Korea. In previous works, only historical data of the target link are used for prediction. Here, we extract features from real traffic data by considering the neighboring links. After obtaining the candidate features, linear regression, model tree, and k-nearest neighbor (k-NN) are employed for both feature selection and speed prediction. The experiment results show that k-NN outperforms model tree and linear regression for the given dataset. Compared to the other predictors, k-NN significantly reduces the error measures that we use, including mean absolute percentage error (MAPE) and root mean square error (RMSE).

A Comparative Experiment on Dimensional Reduction Methods Applicable for Dissimilarity-Based Classifications (비유사도-기반 분류를 위한 차원 축소방법의 비교 실험)

  • Kim, Sang-Woon
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.53 no.3
    • /
    • pp.59-66
    • /
    • 2016
  • This paper presents an empirical evaluation on dimensionality reduction strategies by which dissimilarity-based classifications (DBC) can be implemented efficiently. In DBC, classification is not based on feature measurements of individual objects (a set of attributes), but rather on a suitable dissimilarity measure among the individual objects (pair-wise object comparisons). One problem of DBC is the high dimensionality of the dissimilarity space when a lots of objects are treated. To address this issue, two kinds of solutions have been proposed in the literature: prototype selection (PS)-based methods and dimension reduction (DR)-based methods. In this paper, instead of utilizing the PS-based or DR-based methods, a way of performing DBC in Eigen spaces (ES) is considered and empirically compared. In ES-based DBC, classifications are performed as follows: first, a set of principal eigenvectors is extracted from the training data set using a principal component analysis; second, an Eigen space is expanded using a subset of the extracted and selected Eigen vectors; third, after measuring distances among the projected objects in the Eigen space using $l_p$-norms as the dissimilarity, classification is performed. The experimental results, which are obtained using the nearest neighbor rule with artificial and real-life benchmark data sets, demonstrate that when the dimensionality of the Eigen spaces has been selected appropriately, compared to the PS-based and DR-based methods, the performance of the ES-based DBC can be improved in terms of the classification accuracy.

A Study on the Intelligent Online Judging System Using User-Based Collaborative Filtering

  • Hyun Woo Kim;Hye Jin Yun;Kwihoon Kim
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.1
    • /
    • pp.273-285
    • /
    • 2024
  • With the active utilization of Online Judge (OJ) systems in the field of education, various studies utilizing learner data have emerged. This research proposes a problem recommendation based on a user-based collaborative filtering approach with learner data to support learners in their problem selection. Assistance in learners' problem selection within the OJ system is crucial for enhancing the effectiveness of education as it impacts the learning path. To achieve this, this system identifies learners with similar problem-solving tendencies and utilizes their problem-solving history. The proposed technique has been implemented on an OJ site in the fields of algorithms and programming, operated by the Chungbuk Education Research and Information Institute. The technique's service utility and usability were assessed through expert reviews using the Delphi technique. Additionally, it was piloted with site users, and an analysis of the ratio of correctness revealed approximately a 16% higher submission rate for recommended problems compared to the overall submissions. A survey targeting users who used the recommended problems yielded a 78% response rate, with the majority indicating that the feature was helpful. However, low selection rates of recommended problems and low response rates within the subset of users who used recommended problems highlight the need for future research focusing on improving accessibility, enhancing user feedback collection, and diversifying learner data analysis.

Application of Decision Tree for the Classification of Antimicrobial Peptide

  • Lee, Su Yeon;Kim, Sunkyu;Kim, Sukwon S.;Cha, Seon Jeong;Kwon, Young Keun;Moon, Byung-Ro;Lee, Byeong Jae
    • Genomics & Informatics
    • /
    • v.2 no.3
    • /
    • pp.121-125
    • /
    • 2004
  • The purpose of this study was to investigate the use of decision tree for the classification of antimicrobial peptides. The classification was based on the activities of known antimicrobial peptides against common microbes including Escherichia coli and Staphylococcus aureus. A feature selection was employed to select an effective subset of features from available attribute sets. Sequential applications of decision tree with 17 nodes with 9 leaves and 13 nodes with 7 leaves provided the classification rates of $76.74\%$ and $74.66\%$ against E. coli and S. aureus, respectively. Angle subtended by positively charged face and the positive charge commonly gave higher accuracies in both E. coli and S. aureusdatasets. In this study, we describe a successful application of decision tree that provides the understanding of the effects of physicochemical characteristics of peptides on bacterial membrane.