• Title/Summary/Keyword: data selection

Search Result 5,731, Processing Time 0.035 seconds

Neural and MTS Algorithms for Feature Selection

  • Su, Chao-Ton;Li, Te-Sheng
    • International Journal of Quality Innovation
    • /
    • v.3 no.2
    • /
    • pp.113-131
    • /
    • 2002
  • The relationships among multi-dimensional data (such as medical examination data) with ambiguity and variation are difficult to explore. The traditional approach to building a data classification system requires the formulation of rules by which the input data can be analyzed. The formulation of such rules is very difficult with large sets of input data. This paper first describes two classification approaches using back-propagation (BP) neural network and Mahalanobis distance (MD) classifier, and then proposes two classification approaches for multi-dimensional feature selection. The first one proposed is a feature selection procedure from the trained back-propagation (BP) neural network. The basic idea of this procedure is to compare the multiplication weights between input and hidden layer and hidden and output layer. In order to simplify the structure, only the multiplication weights of large absolute values are used. The second approach is Mahalanobis-Taguchi system (MTS) originally suggested by Dr. Taguchi. The MTS performs Taguchi's fractional factorial design based on the Mahalanobis distance as a performance metric. We combine the automatic thresholding with MD: it can deal with a reduced model, which is the focus of this paper In this work, two case studies will be used as examples to compare and discuss the complete and reduced models employing BP neural network and MD classifier. The implementation results show that proposed approaches are effective and powerful for the classification.

A small review and further studies on the LASSO

  • Kwon, Sunghoon;Han, Sangmi;Lee, Sangin
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.5
    • /
    • pp.1077-1088
    • /
    • 2013
  • High-dimensional data analysis arises from almost all scientific areas, evolving with development of computing skills, and has encouraged penalized estimations that play important roles in statistical learning. For the past years, various penalized estimations have been developed, and the least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani (1996) has shown outstanding ability, earning the first place on the development of penalized estimation. In this paper, we first introduce a number of recent advances in high-dimensional data analysis using the LASSO. The topics include various statistical problems such as variable selection and grouped or structured variable selection under sparse high-dimensional linear regression models. Several unsupervised learning methods including inverse covariance matrix estimation are presented. In addition, we address further studies on new applications which may establish a guideline on how to use the LASSO for statistical challenges of high-dimensional data analysis.

Evaluating Variable Selection Techniques for Multivariate Linear Regression (다중선형회귀모형에서의 변수선택기법 평가)

  • Ryu, Nahyeon;Kim, Hyungseok;Kang, Pilsung
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.42 no.5
    • /
    • pp.314-326
    • /
    • 2016
  • The purpose of variable selection techniques is to select a subset of relevant variables for a particular learning algorithm in order to improve the accuracy of prediction model and improve the efficiency of the model. We conduct an empirical analysis to evaluate and compare seven well-known variable selection techniques for multiple linear regression model, which is one of the most commonly used regression model in practice. The variable selection techniques we apply are forward selection, backward elimination, stepwise selection, genetic algorithm (GA), ridge regression, lasso (Least Absolute Shrinkage and Selection Operator) and elastic net. Based on the experiment with 49 regression data sets, it is found that GA resulted in the lowest error rates while lasso most significantly reduces the number of variables. In terms of computational efficiency, forward/backward elimination and lasso requires less time than the other techniques.

Modeling of Positive Selection for the Development of a Computer Immune System and a Self-Recognition Algorithm

  • Sim, Kwee-Bo;Lee, Dong-Wook
    • International Journal of Control, Automation, and Systems
    • /
    • v.1 no.4
    • /
    • pp.453-458
    • /
    • 2003
  • The anomaly-detection algorithm based on negative selection of T cells is representative model among self-recognition methods and it has been applied to computer immune systems in recent years. In immune systems, T cells are produced through both positive and negative selection. Positive selection is the process used to determine a MHC receptor that recognizes self-molecules. Negative selection is the process used to determine an antigen receptor that recognizes antigen, or the nonself cell. In this paper, we propose a novel self-recognition algorithm based on the positive selection of T cells. We indicate the effectiveness of the proposed algorithm by change-detection simulation of some infected data obtained from cell changes and string changes in the self-file. We also compare the self-recognition algorithm based on positive selection with the anomaly-detection algorithm.

Approaching the Negative Super-SBM Model to Partner Selection of Vietnamese Securities Companies

  • NGUYEN, Xuan Huynh;NGUYEN, Thi Kim Lien
    • The Journal of Asian Finance, Economics and Business
    • /
    • v.8 no.3
    • /
    • pp.527-538
    • /
    • 2021
  • The purpose of the study is to determine the efficiency, position, and partner selection of securities companies via the negative super-SBM model used in data envelopment analysis (DEA). This model utilizes a variety of inputs, including current assets, non-current assets, fixed assets, liabilities, owner's equity and charter capital, and outputs including net revenue, gross profit, operating profit, and net profit after tax collected from the financial reports (Vietstock, 2020) of 32 securities companies, operating during the period from 2016 to 2019, negative data are collected as well. Empirical results determined both efficient and inefficient terms, and then further determined the position of each securities firm under consideration of every term. The overall score arrived at discovered a large performance change realizing a maximum score able to reach 20.791. In the next stage, alliancing inefficient companies was carried out based on the 2019 scores to seek out optimal partners for the inefficient companies. The tested result indicated that AAS was the best partner selection when its partners received a good result after alliancing, as with FTS (11.04469). The partner selection is deemed as a solution helpful to inefficient securities companies in order to improve their future efficiency scores.

Data Mining using Instance Selection in Artificial Neural Networks for Bankruptcy Prediction (기업부도예측을 위한 인공신경망 모형에서의 사례선택기법에 의한 데이터 마이닝)

  • Kim, Kyoung-jae
    • Journal of Intelligence and Information Systems
    • /
    • v.10 no.1
    • /
    • pp.109-123
    • /
    • 2004
  • Corporate financial distress and bankruptcy prediction is one of the major application areas of artificial neural networks (ANNs) in finance and management. ANNs have showed high prediction performance in this area, but sometimes are confronted with inconsistent and unpredictable performance for noisy data. In addition, it may not be possible to train ANN or the training task cannot be effectively carried out without data reduction when the amount of data is so large because training the large data set needs much processing time and additional costs of collecting data. Instance selection is one of popular methods for dimensionality reduction and is directly related to data reduction. Although some researchers have addressed the need for instance selection in instance-based learning algorithms, there is little research on instance selection for ANN. This study proposes a genetic algorithm (GA) approach to instance selection in ANN for bankruptcy prediction. In this study, we use ANN supported by the GA to optimize the connection weights between layers and select relevant instances. It is expected that the globally evolved weights mitigate the well-known limitations of gradient descent algorithm of backpropagation algorithm. In addition, genetically selected instances will shorten the learning time and enhance prediction performance. This study will compare the proposed model with other major data mining techniques. Experimental results show that the GA approach is a promising method for instance selection in ANN.

  • PDF

An Efficient Search Space Generation Technique for Optimal Materialized Views Selection in Data Warehouse Environment (데이타 웨어하우스 환경에서 최적 실체뷰 구성을 위한 효율적인 탐색공간 생성 기법)

  • Lee Tae-Hee;Chang Jae-young;Lee Sang-goo
    • Journal of KIISE:Databases
    • /
    • v.31 no.6
    • /
    • pp.585-595
    • /
    • 2004
  • A query processing is a critical issue in data warehouse environment since queries on data warehouses often involve hundreds of complex operations over large volumes of data. Data warehouses therefore build a large number of materialized views to increase the system performance. Which views to materialized is an important factor on the view maintenance cost as well as the query performance. The goal of materialized view selection problem is to select an optimal set of views that minimizes total query response time in addition to the view maintenance cost. In this paper, we present an efficient solution for the materialized view selection problem. Although the optimal selection of materialized views is NP-hard problem, we developed a feasible solution by utilizing the characteristics of relational operators such as join, selection, and grouping.

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data (고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법)

  • Lee, Changki;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.47 no.3
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

A study on the Predictors of criteria on Clothing Selection (의복선택기준 예측변인 연구)

  • Shin, Jeong-Won;Park, Eun-Joo
    • Journal of the Korean Society of Costume
    • /
    • v.13
    • /
    • pp.123-134
    • /
    • 1989
  • The purpose of this study was to identify the predictable variables of criteria on clothing selection. Relationships among criteria on clothing selection, psychological variable, lifestyle variable, and demographic variable were tested by Pearsons' correlation coefficients and One-way ANOVA. The predictors of criteria on clothing selection were identified by Regression. The consumers were classified into several benefit-segments by criteria on clothing selection, and then, the character of each segment were identified by Multiple Discriminant Analysis. Data was obtained from 593 women living in Pusan by self-administered questionnaires. The results of the study were as follows; 1. Relationship between criteria on clothing selection and relative variables. 1) The important variables to criteria on clothing selection were "down-to-earth-sophisticated", "traditional-morden", "conventional-different", "conscientious-expendient", need for exhibitionism, need for sex, fashion / appearance. 2) The important factor of clothing selection criteria was comfort and it has significant difference among ages. 3) The higher of social-economic status have the more appearance-oriented selection. 2. Predictors of criteria on clothing selection. There were several important predictors of criteria on clothing selection like lifestyle, need, and self-image. Especially, fashion / appearance in lifestyle variable was very important. 3. Segmentation by the criteria on clothing selection. There are four groups Classified by the criteria on clothing selection, that is practical-oriented group, appearance-oriented group, practical and appearance-oriented group, and indifference group. The significant discriminative variables were Fashion / appearance factor, need for exhibitionism, and need for sex. The result of this study can be used for a enterprise to analysis the consumer and to build the strategy of advertisement clothing.

  • PDF

Advances in Data-Driven Bandwidth Selection

  • Park, Byeong U.
    • Journal of the Korean Statistical Society
    • /
    • v.20 no.1
    • /
    • pp.1-28
    • /
    • 1991
  • Considerable progress on the problem of data-driven bandwidth selection in kernel density estimation has been made recently. The goal of this paper is to provide an introduction to the methods currently available, with discussion at both a practical and a nontechnical theoretical level. The main setting considered here is global bandwidth kernel estimation, but some recent results on variable bandwidth kernel estimation are also included.

  • PDF