• Title/Summary/Keyword: data selection

Search Result 5,731, Processing Time 0.038 seconds

Bayesian Parameter :Estimation and Variable Selection in Random Effects Generalised Linear Models for Count Data

  • Oh, Man-Suk;Park, Tae-Sung
    • Journal of the Korean Statistical Society
    • /
    • v.31 no.1
    • /
    • pp.93-107
    • /
    • 2002
  • Random effects generalised linear models are useful for analysing clustered count data in which responses are usually correlated. We propose a Bayesian approach to parameter estimation and variable selection in random effects generalised linear models for count data. A simple Gibbs sampling algorithm for parameter estimation is presented and a simple and efficient variable selection is done by using the Gibbs outputs. An illustrative example is provided.

A Feature Selection Method Based on Fuzzy Cluster Analysis (퍼지 클러스터 분석 기반 특징 선택 방법)

  • Rhee, Hyun-Sook
    • The KIPS Transactions:PartB
    • /
    • v.14B no.2
    • /
    • pp.135-140
    • /
    • 2007
  • Feature selection is a preprocessing technique commonly used on high dimensional data. Feature selection studies how to select a subset or list of attributes that are used to construct models describing data. Feature selection methods attempt to explore data's intrinsic properties by employing statistics or information theory. The recent developments have involved approaches like correlation method, dimensionality reduction and mutual information technique. This feature selection have become the focus of much research in areas of applications with massive and complex data sets. In this paper, we provide a feature selection method considering data characteristics and generalization capability. It provides a computational approach for feature selection based on fuzzy cluster analysis of its attribute values and its performance measures. And we apply it to the system for classifying computer virus and compared with heuristic method using the contrast concept. Experimental result shows the proposed approach can give a feature ranking, select the features, and improve the system performance.

The Game Selection Model for the Payoff Strategy Optimization of Mobile CrowdSensing Task

  • Zhao, Guosheng;Liu, Dongmei;Wang, Jian
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.4
    • /
    • pp.1426-1447
    • /
    • 2021
  • The payoff game between task publishers and users in the mobile crowdsensing environment is a hot topic of research. A optimal payoff selection model based on stochastic evolutionary game is proposed. Firstly, the process of payoff optimization selection is modeled as a task publisher-user stochastic evolutionary game model. Secondly, the low-quality data is identified by the data quality evaluation algorithm, which improves the fitness of perceptual task matching target users, so that task publishers and users can obtain the optimal payoff at the current moment. Finally, by solving the stability strategy and analyzing the stability of the model, the optimal payoff strategy is obtained under different intensity of random interference and different initial state. The simulation results show that, in the aspect of data quality evaluation, compared with BP detection method and SVM detection method, the accuracy of anomaly data detection of the proposed model is improved by 8.1% and 0.5% respectively, and the accuracy of data classification is improved by 59.2% and 32.2% respectively. In the aspect of the optimal payoff strategy selection, it is verified that the proposed model can reasonably select the payoff strategy.

A Study on Classifications of Remote Sensed Multispectral Image Data using Soft Computing Technique - Stressed on Rough Sets - (소프트 컴퓨팅기술을 이용한 원격탐사 다중 분광 이미지 데이터의 분류에 관한 연구 -Rough 집합을 중심으로-)

  • Won Sung-Hyun
    • Management & Information Systems Review
    • /
    • v.3
    • /
    • pp.15-45
    • /
    • 1999
  • Processing techniques of remote sensed image data using computer have been recognized very necessary techniques to all social fields, such as, environmental observation, land cultivation, resource investigation, military trend grasp and agricultural product estimation, etc. Especially, accurate classification and analysis to remote sensed image da are important elements that can determine reliability of remote sensed image data processing systems, and many researches have been processed to improve these accuracy of classification and analysis. Traditionally, remote sensed image data processing systems have been processed 2 or 3 selected bands in multiple bands, in this time, their selection criterions are statistical separability or wavelength properties. But, it have be bring up the necessity of bands selection method by data distribution characteristics than traditional bands selection by wavelength properties or statistical separability. Because data sensing environments change from multispectral environments to hyperspectral environments. In this paper for efficient data classification in multispectral bands environment, a band feature extraction method using the Rough sets theory is proposed. First, we make a look up table from training data, and analyze the properties of experimental multispectral image data, then select the efficient band using indiscernibility relation of Rough set theory from analysis results. Proposed method is applied to LANDSAT TM data on 2 June 1992. From this, we show clustering trends that similar to traditional band selection results by wavelength properties, from this, we verify that can use the proposed method that centered on data properties to select the efficient bands, though data sensing environment change to hyperspectral band environments.

  • PDF

Improving an Ensemble Model Using Instance Selection Method (사례 선택 기법을 활용한 앙상블 모형의 성능 개선)

  • Min, Sung-Hwan
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.39 no.1
    • /
    • pp.105-115
    • /
    • 2016
  • Ensemble classification involves combining individually trained classifiers to yield more accurate prediction, compared with individual models. Ensemble techniques are very useful for improving the generalization ability of classifiers. The random subspace ensemble technique is a simple but effective method for constructing ensemble classifiers; it involves randomly drawing some of the features from each classifier in the ensemble. The instance selection technique involves selecting critical instances while deleting and removing irrelevant and noisy instances from the original dataset. The instance selection and random subspace methods are both well known in the field of data mining and have proven to be very effective in many applications. However, few studies have focused on integrating the instance selection and random subspace methods. Therefore, this study proposed a new hybrid ensemble model that integrates instance selection and random subspace techniques using genetic algorithms (GAs) to improve the performance of a random subspace ensemble model. GAs are used to select optimal (or near optimal) instances, which are used as input data for the random subspace ensemble model. The proposed model was applied to both Kaggle credit data and corporate credit data, and the results were compared with those of other models to investigate performance in terms of classification accuracy, levels of diversity, and average classification rates of base classifiers in the ensemble. The experimental results demonstrated that the proposed model outperformed other models including the single model, the instance selection model, and the original random subspace ensemble model.

FAFS: A Fuzzy Association Feature Selection Method for Network Malicious Traffic Detection

  • Feng, Yongxin;Kang, Yingyun;Zhang, Hao;Zhang, Wenbo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.1
    • /
    • pp.240-259
    • /
    • 2020
  • Analyzing network traffic is the basis of dealing with network security issues. Most of the network security systems depend on the feature selection of network traffic data and the detection ability of malicious traffic in network can be improved by the correct method of feature selection. An FAFS method, which is short for Fuzzy Association Feature Selection method, is proposed in this paper for network malicious traffic detection. Association rules, which can reflect the relationship among different characteristic attributes of network traffic data, are mined by association analysis. The membership value of association rules are obtained by the calculation of fuzzy reasoning. The data features with the highest correlation intensity in network data sets are calculated by comparing the membership values in association rules. The dimension of data features are reduced and the detection ability of malicious traffic detection algorithm in network is improved by FAFS method. To verify the effect of malicious traffic feature selection by FAFS method, FAFS method is used to select data features of different dataset in this paper. Then, K-Nearest Neighbor algorithm, C4.5 Decision Tree algorithm and Naïve Bayes algorithm are used to test on the dataset above. Moreover, FAFS method is also compared with classical feature selection methods. The analysis of experimental results show that the precision and recall rate of malicious traffic detection in the network can be significantly improved by FAFS method, which provides a valuable reference for the establishment of network security system.

Discretization Method Based on Quantiles for Variable Selection Using Mutual Information

  • CHa, Woon-Ock;Huh, Moon-Yul
    • Communications for Statistical Applications and Methods
    • /
    • v.12 no.3
    • /
    • pp.659-672
    • /
    • 2005
  • This paper evaluates discretization of continuous variables to select relevant variables for supervised learning using mutual information. Three discretization methods, MDL, Histogram and 4-Intervals are considered. The process of discretization and variable subset selection is evaluated according to the classification accuracies with the 6 real data sets of UCI databases. Results show that 4-Interval discretization method based on quantiles, is robust and efficient for variable selection process. We also visually evaluate the appropriateness of the selected subset of variables.

H-likelihood approach for variable selection in gamma frailty models

  • Ha, Il-Do;Cho, Geon-Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.23 no.1
    • /
    • pp.199-207
    • /
    • 2012
  • Recently, variable selection methods using penalized likelihood with a shrink penalty function have been widely studied in various statistical models including generalized linear models and survival models. In particular, they select important variables and estimate coefficients of covariates simultaneously. In this paper, we develop a penalize h-likelihood method for variable selection in gamma frailty models. For this we use the smoothly clipped absolute deviation (SCAD) penalty function, which satisfies a good property in variable selection. The proposed method is illustrated using simulation study and a practical data set.

Variable selection in Poisson HGLMs using h-likelihoood

  • Ha, Il Do;Cho, Geon-Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.6
    • /
    • pp.1513-1521
    • /
    • 2015
  • Selecting relevant variables for a statistical model is very important in regression analysis. Recently, variable selection methods using a penalized likelihood have been widely studied in various regression models. The main advantage of these methods is that they select important variables and estimate the regression coefficients of the covariates, simultaneously. In this paper, we propose a simple procedure based on a penalized h-likelihood (HL) for variable selection in Poisson hierarchical generalized linear models (HGLMs) for correlated count data. For this we consider three penalty functions (LASSO, SCAD and HL), and derive the corresponding variable-selection procedures. The proposed method is illustrated using a practical example.

Genomic Selection for Adjacent Genetic Markers of Yorkshire Pigs Using Regularized Regression Approaches

  • Park, Minsu;Kim, Tae-Hun;Cho, Eun-Seok;Kim, Heebal;Oh, Hee-Seok
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.27 no.12
    • /
    • pp.1678-1683
    • /
    • 2014
  • This study considers a problem of genomic selection (GS) for adjacent genetic markers of Yorkshire pigs which are typically correlated. The GS has been widely used to efficiently estimate target variables such as molecular breeding values using markers across the entire genome. Recently, GS has been applied to animals as well as plants, especially to pigs. For efficient selection of variables with specific traits in pig breeding, it is required that any such variable selection retains some properties: i) it produces a simple model by identifying insignificant variables; ii) it improves the accuracy of the prediction of future data; and iii) it is feasible to handle high-dimensional data in which the number of variables is larger than the number of observations. In this paper, we applied several variable selection methods including least absolute shrinkage and selection operator (LASSO), fused LASSO and elastic net to data with 47K single nucleotide polymorphisms and litter size for 519 observed sows. Based on experiments, we observed that the fused LASSO outperforms other approaches.