• Title/Summary/Keyword: Subset selection

Search Result 203, Processing Time 0.024 seconds

Feature Selection Using Submodular Approach for Financial Big Data

  • Attigeri, Girija;Manohara Pai, M.M.;Pai, Radhika M.
    • Journal of Information Processing Systems
    • /
    • v.15 no.6
    • /
    • pp.1306-1325
    • /
    • 2019
  • As the world is moving towards digitization, data is generated from various sources at a faster rate. It is getting humungous and is termed as big data. The financial sector is one domain which needs to leverage the big data being generated to identify financial risks, fraudulent activities, and so on. The design of predictive models for such financial big data is imperative for maintaining the health of the country's economics. Financial data has many features such as transaction history, repayment data, purchase data, investment data, and so on. The main problem in predictive algorithm is finding the right subset of representative features from which the predictive model can be constructed for a particular task. This paper proposes a correlation-based method using submodular optimization for selecting the optimum number of features and thereby, reducing the dimensions of the data for faster and better prediction. The important proposition is that the optimal feature subset should contain features having high correlation with the class label, but should not correlate with each other in the subset. Experiments are conducted to understand the effect of the various subsets on different classification algorithms for loan data. The IBM Bluemix BigData platform is used for experimentation along with the Spark notebook. The results indicate that the proposed approach achieves considerable accuracy with optimal subsets in significantly less execution time. The algorithm is also compared with the existing feature selection and extraction algorithms.

Support vector machines with optimal instance selection: An application to bankruptcy prediction

  • Ahn Hyun-Chul;Kim Kyoung-Jae;Han In-Goo
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2006.06a
    • /
    • pp.167-175
    • /
    • 2006
  • Building accurate corporate bankruptcy prediction models has been one of the most important research issues in finance. Recently, support vector machines (SVMs) are popularly applied to bankruptcy prediction because of its many strong points. However, in order to use SVM, a modeler should determine several factors by heuristics, which hinders from obtaining accurate prediction results by using SVM. As a result, some researchers have tried to optimize these factors, especially the feature subset and kernel parameters of SVM But, there have been no studies that have attempted to determine appropriate instance subset of SVM, although it may improve the performance by eliminating distorted cases. Thus in the study, we propose the simultaneous optimization of the instance selection as well as the parameters of a kernel function of SVM by using genetic algorithms (GAs). Experimental results show that our model outperforms not only conventional SVM, but also prior approaches for optimizing SVM.

  • PDF

A SELECTION PROCEDURE FOR GOOD LOGISTICS POPULATIONS

  • Singh, Parminder;Gill, A.N.
    • Journal of the Korean Statistical Society
    • /
    • v.32 no.3
    • /
    • pp.299-309
    • /
    • 2003
  • Let ${\pi}_1,...,{\pi}_{k}$k($\geq$2) independent logistic populations such that the cumulative distribution function (cdf) of an observation from the population ${\pi}_{i}$ is $$F_{i}\;=\; {\frac{1}{1+exp{-\pi(x-{\mu}_{i})/(\sigma\sqrt{3})}}},\;$\mid$x$\mid$<\;{\infty}$$ where ${\mu}_{i}(-{\infty}\; < \; {\mu}_{i}\; <\; {\infty}$ is unknown location mean and ${\delta}^2$ is known variance, i = 1,..., $textsc{k}$. Let ${\mu}_{[k]}$ be the largest of all ${\mu}$'s and the population ${\pi}_{i}$ is defined to be 'good' if ${\mu}_{i}\;{\geq}\;{\mu}_{[k]}\;-\;{\delta}_1$, where ${\delta}_1\;>\;0$, i = 1,...,$textsc{k}$. A selection procedure based on sample median is proposed to select a subset of $textsc{k}$ logistic populations which includes all the good populations with probability at least $P^{*}$(a preassigned value). Simultaneous confidence intervals for the differences of location parameters, which can be derived with the help of proposed procedures, are discussed. If a population with location parameter ${\mu}_{i}\;<\;{\mu}_{[k]}\;-\;{\delta}_2({\delta}_2\;>{\delta}_1)$, i = 1,...,$textsc{k}$ is considered 'bad', a selection procedure is proposed so that the probability of either selecting a bad population or omitting a good population is at most 1­ $P^{*}$.

Prediction model of hypercholesterolemia using body fat mass based on machine learning (머신러닝 기반 체지방 측정정보를 이용한 고콜레스테롤혈증 예측모델)

  • Lee, Bum Ju
    • The Journal of the Convergence on Culture Technology
    • /
    • v.5 no.4
    • /
    • pp.413-420
    • /
    • 2019
  • The purpose of the present study is to develop a model for predicting hypercholesterolemia using an integrated set of body fat mass variables based on machine learning techniques, beyond the study of the association between body fat mass and hypercholesterolemia. For this study, a total of six models were created using two variable subset selection methods and machine learning algorithms based on the Korea National Health and Nutrition Examination Survey (KNHANES) data. Among the various body fat mass variables, we found that trunk fat mass was the best variable for predicting hypercholesterolemia. Furthermore, we obtained the area under the receiver operating characteristic curve value of 0.739 and the Matthews correlation coefficient value of 0.36 in the model using the correlation-based feature subset selection and naive Bayes algorithm. Our findings are expected to be used as important information in the field of disease prediction in large-scale screening and public health research.

A study of methodology for identification models of cardiovascular diseases based on data mining (데이터마이닝을 이용한 심혈관질환 판별 모델 방법론 연구)

  • Lee, Bum Ju
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.4
    • /
    • pp.339-345
    • /
    • 2022
  • Cardiovascular diseases is one of the leading causes of death in the world. The objectives of this study were to build various models using sociodemographic variables based on three variable selection methods and seven machine learning algorithms for the identification of hypertension and dyslipidemia and to evaluate predictive powers of the models. In experiments based on full variables and correlation-based feature subset selection methods, our results showed that performance of models using naive Bayes was better than those of models using other machine learning algorithms in both two diseases. In wrapper-based feature subset selection method, performance of models using logistic regression was higher than those of models using other algorithms. Our finding may provide basic data for public health and machine learning fields.

AutoFe-Sel: A Meta-learning based methodology for Recommending Feature Subset Selection Algorithms

  • Irfan Khan;Xianchao Zhang;Ramesh Kumar Ayyasam;Rahman Ali
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.7
    • /
    • pp.1773-1793
    • /
    • 2023
  • Automated machine learning, often referred to as "AutoML," is the process of automating the time-consuming and iterative procedures that are associated with the building of machine learning models. There have been significant contributions in this area across a number of different stages of accomplishing a data-mining task, including model selection, hyper-parameter optimization, and preprocessing method selection. Among them, preprocessing method selection is a relatively new and fast growing research area. The current work is focused on the recommendation of preprocessing methods, i.e., feature subset selection (FSS) algorithms. One limitation in the existing studies regarding FSS algorithm recommendation is the use of a single learner for meta-modeling, which restricts its capabilities in the metamodeling. Moreover, the meta-modeling in the existing studies is typically based on a single group of data characterization measures (DCMs). Nonetheless, there are a number of complementary DCM groups, and their combination will allow them to leverage their diversity, resulting in improved meta-modeling. This study aims to address these limitations by proposing an architecture for preprocess method selection that uses ensemble learning for meta-modeling, namely AutoFE-Sel. To evaluate the proposed method, we performed an extensive experimental evaluation involving 8 FSS algorithms, 3 groups of DCMs, and 125 datasets. Results show that the proposed method achieves better performance compared to three baseline methods. The proposed architecture can also be easily extended to other preprocessing method selections, e.g., noise-filter selection and imbalance handling method selection.

Feature Selection for Classification of Mass Spectrometric Proteomic Data Using Random Forest (단백체 스펙트럼 데이터의 분류를 위한 랜덤 포리스트 기반 특성 선택 알고리즘)

  • Ohn, Syng-Yup;Chi, Seung-Do;Han, Mi-Young
    • Journal of the Korea Society for Simulation
    • /
    • v.22 no.4
    • /
    • pp.139-147
    • /
    • 2013
  • This paper proposes a novel method for feature selection for mass spectrometric proteomic data based on Random Forest. The method includes an effective preprocessing step to filter a large amount of redundant features with high correlation and applies a tournament strategy to get an optimal feature subset. Experiments on three public datasets, Ovarian 4-3-02, Ovarian 7-8-02 and Prostate shows that the new method achieves high performance comparing with widely used methods and balanced rate of specificity and sensitivity.

Operating characteristics of a subset selection procedure for selecting the best normal population with common unknown variance (최고의 정규 모집단을 뽑기 위한 부분집합선택절차론의 운용특성에 관한 연구)

  • ;Shanti S. Gupta
    • The Korean Journal of Applied Statistics
    • /
    • v.3 no.1
    • /
    • pp.59-78
    • /
    • 1990
  • The subset selection approach introduced by Gupta plays an important role in the multiple decision procedures. For the normal means problem with common unknown variance, some operating characteristics of the selection procedure have been investigated via Monte Carlo simulation. Also some properties including efficiencies of the selection procedure are examined when the data are contaminated.

  • PDF

Selection of Color Smaples based on Genetic Algorithm for Color Correction (유전알고리즘을 이용한 색 보정용 색 샘플 결정)

  • 이규헌;김춘우
    • Journal of the Korean Institute of Telematics and Electronics S
    • /
    • v.34S no.1
    • /
    • pp.94-104
    • /
    • 1997
  • Most color imaging devices often exhibit color distortions due to the differences in realizable color gamuts and nonlinear characteristics of their components. In order to minimize color differences, it is desirable to apply color correction techniques. Th efirst step of color correction is to select the subset of the color coordinates representing the input color space. Th eselected subset serves as so called color samples to model the color distortion of a given color imaging device. The effectiveness of color correction is determined by the color sampels utilized in the modeling as well as the applied color correction technique. This paper presents a new selection method for color samples based on gentic algorithm. In the proposed method, structure of strings are designed so that the selected color samples fully represent the characteristics of color imaging device and consist of distinct color coordinates. To evaluate the performance of the selected color samples, they ar etuilized for three different color correction experiments. The experimentsal results are comapred with the crresponding results obtianed with the equally spaced color samples.

  • PDF