Search | Korea Science

A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data

Pouramini, Jafar;Minaei-Bidgoli, Behrouze;Esmaeili, Mahdi
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.12 no.8
- /
- pp.3725-3748
- /
- 2018
Text data distribution is often imbalanced. Imbalanced data is one of the challenges in text classification, as it leads to the loss of performance of classifiers. Many studies have been conducted so far in this regard. The proposed solutions are divided into several general categories, include sampling-based and algorithm-based methods. In recent studies, feature selection has also been considered as one of the solutions for the imbalance problem. In this paper, a novel one-sided feature selection known as probabilistic feature selection (PFS) was presented for imbalanced text classification. The PFS is a probabilistic method that is calculated using feature distribution. Compared to the similar methods, the PFS has more parameters. In order to evaluate the performance of the proposed method, the feature selection methods including Gini, MI, FAST and DFS were implemented. To assess the proposed method, the decision tree classifications such as C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per F-measure suggested that the proposed feature selection has significantly improved the performance of the classifiers.
https://doi.org/10.3837/tiis.2018.08.010 인용 PDF KSCI

A Study on Developing a CER Using Production Cost Data in Korean Maneuver Weapon System (한국형 기동무기체계 양산비 비용추정관계식 개발에 관한 연구)

Lee, Doo-Hyun;Kim, Gak-Gyu
- Journal of the Korean Operations Research and Management Science Society
- /
- v.39 no.3
- /
- pp.51-61
- /
- 2014
In this paper, we deal with developing a cost estimation relationships (CER) for Korean maneuverable weapons systems using historical production cost. To develop the CER, we collected the historical data of the production cost of four tanks and five armored vehicles. We also analyzed the Required Operational Capability (ROC) of the weapons systems and chose cost drivers that can compare operational capabilities of the weapons systems We used Forward selection, Backward selection, Stepwise Regression and $R^2$ selection as the cost drivers which have the greatest influence with the dependent variables. And we used Principle Component Regression, Robust Regression and Weighted Regression to deal with multicollinearity and outlier among the data to develop a more appropriate CER. As a result, we were able to develop a production cost CER for Korean maneuverable weapons systems that have the lowest cost errors. Thus, this research is meaningful in terms of developing a CER based on Korean original cost data without foreign data and these methods will contribute to developing a Korean cost analysis program in the future.
https://doi.org/10.7737/JKORMS.2014.39.3.051 인용 PDF KSCI

Comparison of model selection criteria in graphical LASSO (그래프 LASSO에서 모형선택기준의 비교)

Ahn, Hyeongseok;Park, Changyi
- Journal of the Korean Data and Information Science Society
- /
- v.25 no.4
- /
- pp.881-891
- /
- 2014
Graphical models can be used as an intuitive tool for modeling a complex stochastic system with a large number of variables related each other because the conditional independence between random variables can be visualized as a network. Graphical least absolute shrinkage and selection operator (LASSO) is considered to be effective in avoiding overfitting in the estimation of Gaussian graphical models for high dimensional data. In this paper, we consider the model selection problem in graphical LASSO. Particularly, we compare various model selection criteria via simulations and analyze a real financial data set.
https://doi.org/10.7465/jkdi.2014.25.4.881 인용 PDF KSCI

Subset selection in multiple linear regression: An improved Tabu search

Bae, Jaegug;Kim, Jung-Tae;Kim, Jae-Hwan
- Journal of Advanced Marine Engineering and Technology
- /
- v.40 no.2
- /
- pp.138-145
- /
- 2016
This paper proposes an improved tabu search method for subset selection in multiple linear regression models. Variable selection is a vital combinatorial optimization problem in multivariate statistics. The selection of the optimal subset of variables is necessary in order to reliably construct a multiple linear regression model. Its applications widely range from machine learning, timeseries prediction, and multi-class classification to noise detection. Since this problem has NP-complete nature, it becomes more difficult to find the optimal solution as the number of variables increases. Two typical metaheuristic methods have been developed to tackle the problem: the tabu search algorithm and hybrid genetic and simulated annealing algorithm. However, these two methods have shortcomings. The tabu search method requires a large amount of computing time, and the hybrid algorithm produces a less accurate solution. To overcome the shortcomings of these methods, we propose an improved tabu search algorithm to reduce moves of the neighborhood and to adopt an effective move search strategy. To evaluate the performance of the proposed method, comparative studies are performed on small literature data sets and on large simulation data sets. Computational results show that the proposed method outperforms two metaheuristic methods in terms of the computing time and solution quality.
https://doi.org/10.5916/jkosme.2016.40.2.138 인용 PDF KSCI

Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data

Mehmood, Tahir;Rasheed, Zahid
- Communications for Statistical Applications and Methods
- /
- v.22 no.6
- /
- pp.575-587
- /
- 2015
The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.
https://doi.org/10.5351/CSAM.2015.22.6.575 인용 PDF KSCI

Bayesian Model Selection in the Gamma Populations

Kang, Sang-Gil;Kang, Doo-Young
- Journal of the Korean Data and Information Science Society
- /
- v.17 no.4
- /
- pp.1329-1341
- /
- 2006
When X and Y have independent gamma distributions, we consider the testing problem for two gamma means. We propose a solution based on a Bayesian model selection procedure to this problem in which no subjective input is considered. The reference prior is derived. Using the derived reference prior, we compute the fractional Bayes factor and the intrinsic Bayes factors. The posterior probability of each model is used as a model selection tool. Simulation study and a real data example are provided.
PDF

Cox proportional hazard model with L1 penalty

Hwang, Chang-Ha;Shim, Joo-Yong
- Journal of the Korean Data and Information Science Society
- /
- v.22 no.3
- /
- pp.613-618
- /
- 2011
The proposed method is based on a penalized log partial likelihood of Cox proportional hazard model with L1-penalty. We use the iteratively reweighted least squares procedure to solve L1 penalized log partial likelihood function of Cox proportional hazard model. It provide the ecient computation including variable selection and leads to the generalized cross validation function for the model selection. Experimental results are then presented to indicate the performance of the proposed procedure.
PDF KSCI

Bayesian Model Selection in Weibull Populations

Kang, Sang-Gil
- Journal of the Korean Data and Information Science Society
- /
- v.18 no.4
- /
- pp.1123-1134
- /
- 2007
This article addresses the problem of testing whether the shape parameters in k independent Weibull populations are equal. We propose a Bayesian model selection procedure for equality of the shape parameters. The noninformative prior is usually improper which yields a calibration problem that makes the Bayes factor to be defined up to a multiplicative constant. So we propose the objective Bayesian model selection procedure based on the fractional Bayes factor and the intrinsic Bayes factor under the reference prior. Simulation study and a real example are provided.
PDF

A Hybrid Efficient Feature Selection Model for High Dimensional Data Set based on KNHNAES (2013~2015) (KNHNAES (2013~2015) 에 기반한 대형 특징 공간 데이터집 혼합형 효율적인 특징 선택 모델)

Kwon, Tae il;Li, Dingkun;Park, Hyun Woo;Ryu, Kwang Sun;Kim, Eui Tak;Piao, Minghao
- Journal of Digital Contents Society
- /
- v.19 no.4
- /
- pp.739-747
- /
- 2018
With a large feature space data, feature selection has become an extremely important procedure in the Data Mining process. But the traditional feature selection methods with single process may no longer fit for this procedure. In this paper, we proposed a hybrid efficient feature selection model for high dimensional data. We have applied our model on KNHNAES data set, the result shows that our model outperforms many existing methods in terms of accuracy over than at least 5%.
https://doi.org/10.9728/dcs.2018.19.4.739 인용 PDF KSCI

Improved Network Intrusion Detection Model through Hybrid Feature Selection and Data Balancing (Hybrid Feature Selection과 Data Balancing을 통한 효율적인 네트워크 침입 탐지 모델)

Min, Byeongjun;Ryu, Jihun;Shin, Dongkyoo;Shin, Dongil
- KIPS Transactions on Software and Data Engineering
- /
- v.10 no.2
- /
- pp.65-72
- /
- 2021
Recently, attacks on the network environment have been rapidly escalating and intelligent. Thus, the signature-based network intrusion detection system is becoming clear about its limitations. To solve these problems, research on machine learning-based intrusion detection systems is being conducted in many ways, but two problems are encountered to use machine learning for intrusion detection. The first is to find important features associated with learning for real-time detection, and the second is the imbalance of data used in learning. This problem is fatal because the performance of machine learning algorithms is data-dependent. In this paper, we propose the HSF-DNN, a network intrusion detection model based on a deep neural network to solve the problems presented above. The proposed HFS-DNN was learned through the NSL-KDD data set and performs performance comparisons with existing classification models. Experiments have confirmed that the proposed Hybrid Feature Selection algorithm does not degrade performance, and in an experiment between learning models that solved the imbalance problem, the model proposed in this paper showed the best performance.
https://doi.org/10.3745/KTSDE.2021.10.2.65 인용 PDF KSCI

Search Result 5,724, Processing Time 0.031 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)