• Title/Summary/Keyword: data selection

Search Result 5,710, Processing Time 0.034 seconds

A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data

  • Pouramini, Jafar;Minaei-Bidgoli, Behrouze;Esmaeili, Mahdi
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제12권8호
    • /
    • pp.3725-3748
    • /
    • 2018
  • Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.

한국형 기동무기체계 양산비 비용추정관계식 개발에 관한 연구 (A Study on Developing a CER Using Production Cost Data in Korean Maneuver Weapon System)

  • 이두현;김각규
    • 한국경영과학회지
    • /
    • 제39권3호
    • /
    • pp.51-61
    • /
    • 2014
  • In this paper, we deal with developing a cost estimation relationships (CER) for Korean maneuverable weapons systems using historical production cost. To develop the CER, we collected the historical data of the production cost of four tanks and five armored vehicles. We also analyzed the Required Operational Capability (ROC) of the weapons systems and chose cost drivers that can compare operational capabilities of the weapons systems We used Forward selection, Backward selection, Stepwise Regression and $R^2$ selection as the cost drivers which have the greatest influence with the dependent variables. And we used Principle Component Regression, Robust Regression and Weighted Regression to deal with multicollinearity and outlier among the data to develop a more appropriate CER. As a result, we were able to develop a production cost CER for Korean maneuverable weapons systems that have the lowest cost errors. Thus, this research is meaningful in terms of developing a CER based on Korean original cost data without foreign data and these methods will contribute to developing a Korean cost analysis program in the future.

그래프 LASSO에서 모형선택기준의 비교 (Comparison of model selection criteria in graphical LASSO)

  • 안형석;박창이
    • Journal of the Korean Data and Information Science Society
    • /
    • 제25권4호
    • /
    • pp.881-891
    • /
    • 2014
  • 그래프모형(graphical model)은 확률 변수들간의 조건부 독립성(conditional independence)을 시각적인 네트워크형태로 표현할 수 있기 때문에, 정보학 (bioinformatics)이나 사회관계망 (social network) 등 수많은 변수들이 서로 연결되어 있는 복잡한 확률 시스템에 대한 직관적인 도구로 활용될 수 있다. 그래프 LASSO (graphical least absolute shrinkage and selection operator)는 고차원의 자료에 대한 가우스 그래프 모형 (Gaussian graphical model)의 추정에서 과대적합 (overfitting)을 방지하는데에 효과적인 것으로 알려진 방법이다. 본 논문에서는 그래프 LASSO 추정에서 매우 중요한 문제인 모형선택에 대하여 고려한다. 특히 여러가지 모형선택기준을 모의실험을 통해 비교하며 실제 금융 자료를 분석한다.

Subset selection in multiple linear regression: An improved Tabu search

  • Bae, Jaegug;Kim, Jung-Tae;Kim, Jae-Hwan
    • Journal of Advanced Marine Engineering and Technology
    • /
    • 제40권2호
    • /
    • pp.138-145
    • /
    • 2016
  • This paper proposes an improved tabu search method for subset selection in multiple linear regression models. Variable selection is a vital combinatorial optimization problem in multivariate statistics. The selection of the optimal subset of variables is necessary in order to reliably construct a multiple linear regression model. Its applications widely range from machine learning, timeseries prediction, and multi-class classification to noise detection. Since this problem has NP-complete nature, it becomes more difficult to find the optimal solution as the number of variables increases. Two typical metaheuristic methods have been developed to tackle the problem: the tabu search algorithm and hybrid genetic and simulated annealing algorithm. However, these two methods have shortcomings. The tabu search method requires a large amount of computing time, and the hybrid algorithm produces a less accurate solution. To overcome the shortcomings of these methods, we propose an improved tabu search algorithm to reduce moves of the neighborhood and to adopt an effective move search strategy. To evaluate the performance of the proposed method, comparative studies are performed on small literature data sets and on large simulation data sets. Computational results show that the proposed method outperforms two metaheuristic methods in terms of the computing time and solution quality.

Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data

  • Mehmood, Tahir;Rasheed, Zahid
    • Communications for Statistical Applications and Methods
    • /
    • 제22권6호
    • /
    • pp.575-587
    • /
    • 2015
  • The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.

Bayesian Model Selection in the Gamma Populations

  • Kang, Sang-Gil;Kang, Doo-Young
    • Journal of the Korean Data and Information Science Society
    • /
    • 제17권4호
    • /
    • pp.1329-1341
    • /
    • 2006
  • When X and Y have independent gamma distributions, we consider the testing problem for two gamma means. We propose a solution based on a Bayesian model selection procedure to this problem in which no subjective input is considered. The reference prior is derived. Using the derived reference prior, we compute the fractional Bayes factor and the intrinsic Bayes factors. The posterior probability of each model is used as a model selection tool. Simulation study and a real data example are provided.

  • PDF

Cox proportional hazard model with L1 penalty

  • Hwang, Chang-Ha;Shim, Joo-Yong
    • Journal of the Korean Data and Information Science Society
    • /
    • 제22권3호
    • /
    • pp.613-618
    • /
    • 2011
  • The proposed method is based on a penalized log partial likelihood of Cox proportional hazard model with L1-penalty. We use the iteratively reweighted least squares procedure to solve L1 penalized log partial likelihood function of Cox proportional hazard model. It provide the ecient computation including variable selection and leads to the generalized cross validation function for the model selection. Experimental results are then presented to indicate the performance of the proposed procedure.

Bayesian Model Selection in Weibull Populations

  • Kang, Sang-Gil
    • Journal of the Korean Data and Information Science Society
    • /
    • 제18권4호
    • /
    • pp.1123-1134
    • /
    • 2007
  • This article addresses the problem of testing whether the shape parameters in k independent Weibull populations are equal. We propose a Bayesian model selection procedure for equality of the shape parameters. The noninformative prior is usually improper which yields a calibration problem that makes the Bayes factor to be defined up to a multiplicative constant. So we propose the objective Bayesian model selection procedure based on the fractional Bayes factor and the intrinsic Bayes factor under the reference prior. Simulation study and a real example are provided.

  • PDF

KNHNAES (2013~2015) 에 기반한 대형 특징 공간 데이터집 혼합형 효율적인 특징 선택 모델 (A Hybrid Efficient Feature Selection Model for High Dimensional Data Set based on KNHNAES (2013~2015))

  • 권태일;이정곤;박현우;류광선;김의탁;박명호
    • 디지털콘텐츠학회 논문지
    • /
    • 제19권4호
    • /
    • pp.739-747
    • /
    • 2018
  • 고차원 데이터에서는 데이터마이닝 기법 중에서 특징 선택은 매우 중요한 과정이 되었다. 그러나 전통적인 단일 특징 선택방법은 더 이상 효율적인 특징선택 기법으로 적합하지 않을 수 있다. 본 논문에서 우리는 고차원 데이터에 대한 효율적인 특징선택을 위하여 혼합형 특징선택 기법을 제안하였다. 본 논문에서는 KNHANES 데이터에 제안한 혼합형 특징선택기법을 적용하여 분류한 결과 기존의 분류기법을 적용한 모델보다 5% 이상의 정확도가 향상되었다.

Hybrid Feature Selection과 Data Balancing을 통한 효율적인 네트워크 침입 탐지 모델 (Improved Network Intrusion Detection Model through Hybrid Feature Selection and Data Balancing)

  • 민병준;유지훈;신동규;신동일
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제10권2호
    • /
    • pp.65-72
    • /
    • 2021
  • 최근 네트워크 환경에 대한 공격이 급속도로 고도화 및 지능화 되고 있기에, 기존의 시그니처 기반 침입탐지 시스템은 한계점이 명확해지고 있다. 이러한 문제를 해결하기 위해서 기계학습 기반의 침입 탐지 시스템에 대한 연구가 활발히 진행되고 있다. 하지만 기계학습을 침입 탐지에 이용하기 위해서는 두 가지 문제에 직면한다. 첫 번째는 실시간 탐지를 위한 학습과 연관된 중요 특징들을 선별하는 문제이며, 두 번째는 학습에 사용되는 데이터의 불균형 문제로, 기계학습 알고리즘들은 데이터에 의존적이기에 이러한 문제는 치명적이다. 본 논문에서는 위 제시된 문제들을 해결하기 위해서 Hybrid Feature Selection과 Data Balancing을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델인 HFS-DNN을 제안한다. NSL-KDD 데이터 셋을 통해 학습을 진행하였으며, 기존 분류 모델들과 성능 비교를 수행한다. 본 연구에서 제안된 Hybrid Feature Selection 알고리즘이 학습 모델의 성능을 왜곡 시키지 않는 것을 확인하였으며, 불균형을 해소한 학습 모델들간 실험에서 본 논문에서 제안한 학습 모델이 가장 좋은 성능을 보였다.