• Title/Summary/Keyword: data selection

Search Result 5,731, Processing Time 0.035 seconds

Variable selection and prediction performance of penalized two-part regression with community-based crime data application

  • Seong-Tae Kim;Man Sik Park
    • Communications for Statistical Applications and Methods
    • /
    • v.31 no.4
    • /
    • pp.441-457
    • /
    • 2024
  • Semicontinuous data are characterized by a mixture of a point probability mass at zero and a continuous distribution of positive values. This type of data is often modeled using a two-part model where the first part models the probability of dichotomous outcomes -zero or positive- and the second part models the distribution of positive values. Despite the two-part model's popularity, variable selection in this model has not been fully addressed, especially, in high dimensional data. The objective of this study is to investigate variable selection and prediction performance of penalized regression methods in two-part models. The performance of the selected techniques in the two-part model is evaluated via simulation studies. Our findings show that LASSO and ENET tend to select more predictors in the model than SCAD and MCP. Consequently, MCP and SCAD outperform LASSO and ENET for β-specificity, and LASSO and ENET perform better than MCP and SCAD with respect to the mean squared error. We find similar results when applying the penalized regression methods to the prediction of crime incidents using community-based data.

Extending the Scope of Automatic Time Series Model Selection: The Package autots for R

  • Jang, Dong-Ik;Oh, Hee-Seok;Kim, Dong-Hoh
    • Communications for Statistical Applications and Methods
    • /
    • v.18 no.3
    • /
    • pp.319-331
    • /
    • 2011
  • In this paper, we propose automatic procedures for the model selection of various univariate time series data. Automatic model selection is important, especially in data mining with large number of time series, for example, the number (in thousands) of signals accessing a web server during a specific time period. Several methods have been proposed for automatic model selection of time series. However, most existing methods focus on linear time series models such as exponential smoothing and autoregressive integrated moving average(ARIMA) models. The key feature that distinguishes the proposed procedures from previous approaches is that the former can be used for both linear time series models and nonlinear time series models such as threshold autoregressive(TAR) models and autoregressive moving average-generalized autoregressive conditional heteroscedasticity(ARMA-GARCH) models. The proposed methods select a model from among the various models in the prediction error sense. We also provide an R package autots that implements the proposed automatic model selection procedures. In this paper, we illustrate these algorithms with the artificial and real data, and describe the implementation of the autots package for R.

Analyzing empirical performance of correlation based feature selection with company credit rank score dataset - Emphasis on KOSPI manufacturing companies -

  • Nam, Youn Chang;Lee, Kun Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.4
    • /
    • pp.63-71
    • /
    • 2016
  • This paper is about applying efficient data mining method which improves the score calculation and proper building performance of credit ranking score system. The main idea of this data mining technique is accomplishing such objectives by applying Correlation based Feature Selection which could also be used to verify the properness of existing rank scores quickly. This study selected 2047 manufacturing companies on KOSPI market during the period of 2009 to 2013, which have their own credit rank scores given by NICE information service agency. Regarding the relevant financial variables, total 80 variables were collected from KIS-Value and DART (Data Analysis, Retrieval and Transfer System). If correlation based feature selection could select more important variables, then required information and cost would be reduced significantly. Through analysis, this study show that the proposed correlation based feature selection method improves selection and classification process of credit rank system so that the accuracy and credibility would be increased while the cost for building system would be decreased.

Minimum Message Length and Classical Methods for Model Selection in Univariate Polynomial Regression

  • Viswanathan, Murlikrishna;Yang, Young-Kyu;WhangBo, Taeg-Keun
    • ETRI Journal
    • /
    • v.27 no.6
    • /
    • pp.747-758
    • /
    • 2005
  • The problem of selection among competing models has been a fundamental issue in statistical data analysis. Good fits to data can be misleading since they can result from properties of the model that have nothing to do with it being a close approximation to the source distribution of interest (for example, overfitting). In this study we focus on the preference among models from a family of polynomial regressors. Three decades of research has spawned a number of plausible techniques for the selection of models, namely, Akaike's Finite Prediction Error (FPE) and Information Criterion (AIC), Schwartz's criterion (SCH), Generalized Cross Validation (GCV), Wallace's Minimum Message Length (MML), Minimum Description Length (MDL), and Vapnik's Structural Risk Minimization (SRM). The fundamental similarity between all these principles is their attempt to define an appropriate balance between the complexity of models and their ability to explain the data. This paper presents an empirical study of the above principles in the context of model selection, where the models under consideration are univariate polynomials. The paper includes a detailed empirical evaluation of the model selection methods on six target functions, with varying sample sizes and added Gaussian noise. The results from the study appear to provide strong evidence in support of the MML- and SRM- based methods over the other standard approaches (FPE, AIC, SCH and GCV).

  • PDF

Set Covering-based Feature Selection of Large-scale Omics Data (Set Covering 기반의 대용량 오믹스데이터 특징변수 추출기법)

  • Ma, Zhengyu;Yan, Kedong;Kim, Kwangsoo;Ryoo, Hong Seo
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.39 no.4
    • /
    • pp.75-84
    • /
    • 2014
  • In this paper, we dealt with feature selection problem of large-scale and high-dimensional biological data such as omics data. For this problem, most of the previous approaches used simple score function to reduce the number of original variables and selected features from the small number of remained variables. In the case of methods that do not rely on filtering techniques, they do not consider the interactions between the variables, or generate approximate solutions to the simplified problem. Unlike them, by combining set covering and clustering techniques, we developed a new method that could deal with total number of variables and consider the combinatorial effects of variables for selecting good features. To demonstrate the efficacy and effectiveness of the method, we downloaded gene expression datasets from TCGA (The Cancer Genome Atlas) and compared our method with other algorithms including WEKA embeded feature selection algorithms. In the experimental results, we showed that our method could select high quality features for constructing more accurate classifiers than other feature selection algorithms.

On the Data Features for Neighbor Path Selection in Computer Network with Regional Failure

  • Yong-Jin Lee
    • International journal of advanced smart convergence
    • /
    • v.12 no.3
    • /
    • pp.13-18
    • /
    • 2023
  • This paper aims to investigate data features for neighbor path selection (NPS) in computer network with regional failures. It is necessary to find an available alternate communication path in advance when regional failures due to earthquakes or forest fires occur simultaneously. We describe previous general heuristics and simulation heuristic to solve the NPS problem in the regional fault network. The data features of general heuristics using proximity and sharing factor and the data features of simulation heuristic using machine learning are explained through examples. Simulation heuristic may be better than general heuristics in terms of communication success. However, additional data features are necessary in order to apply the simulation heuristic to the real environment. We propose novel data features for NPS in computer network with regional failures and Keras modeling for computing the communication success probability of candidate neighbor path.

Effect of Eating-Out Consumption Propensity on Selection Attributes for Dessert Cafe (외식소비성향이 디저트 카페 선택속성에 미치는 영향)

  • Yoon, Jung-Suk
    • Culinary science and hospitality research
    • /
    • v.23 no.7
    • /
    • pp.31-41
    • /
    • 2017
  • The purpose of this study was to verify the relationship of eating-out consumption propensity and selection attributes by consumers have been used dessert cafe and to provide the useful data for efficient establishing marketing strategies to dessert cafe managers. This survey was conducted from 7th to 22th on April, 2017 and a total of 250 responses were distributed, of which 232 were used for analysis, after excluding responses containing missing data. The results from this study are as follows. First, it was found that eating out enjoyment pursuit type and health pursuit type had significant effects on menu and service factor of selection attributes for dessert cafe. Second, only eating out enjoyment pursuit type had significant effects on visual factor of selection attributes for dessert cafe. Third, only health pursuit type had significant effects on health menu factor of selection attributes for dessert cafe. Fourth, economic value pursuit type and atmosphere pursuit type had significant effects on price factor of selection attributes for dessert cafe. This study contributes to useful results for establishing efficient marketing strategies to dessert cafe marketers by examining selection attribution of dessert cafe as recognizing eating-out consumption propensity.

Study on Correlation-based Feature Selection in an Automatic Quality Inspection System using Support Vector Machine (SVM) (SVM 기반 자동 품질검사 시스템에서 상관분석 기반 데이터 선정 연구)

  • Song, Donghwan;Oh, Yeong Gwang;Kim, Namhun
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.42 no.6
    • /
    • pp.370-376
    • /
    • 2016
  • Manufacturing data analysis and its applications are getting a huge popularity in various industries. In spite of the fast advancement in the big data analysis technology, however, the manufacturing quality data monitored from the automated inspection system sometimes is not reliable enough due to the complex patterns of product quality. In this study, thus, we aim to define the level of trusty of an automated quality inspection system and improve the reliability of the quality inspection data. By correlation analysis and feature selection, this paper presents a method of improving the inspection accuracy and efficiency in an SVM-based automatic product quality inspection system using thermal image data in an auto part manufacturing case. The proposed method is implemented in the sealer dispensing process of the automobile manufacturing and verified by the analysis of the optimal feature selection from the quality analysis results.

A Study on the Construction and Site Selection of the Cloud Data Center considering Disaster Information (재해정보를 고려한 클라우드 데이터센터 입지선정에 관한 연구)

  • Kim, Ki-Uk;Kim, Chang-Soo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.16 no.12
    • /
    • pp.2575-2580
    • /
    • 2012
  • The aim of this paper is to analyze factors for site selection of the cloud data center and to develop spatial data model considering disasters information based on the GIS. In this paper, historical areas of the natural and human disaster are considered to analyze location of the cloud center. The model is developed using ArcGIS S/W tool. The model is applied on Busan city using disaster data from storm and flood, and small administrative district located Kang-Seo-Gu is selected as site selection of the cloud data center of Busan.

Active Selection of Label Data for Semi-Supervised Learning Algorithm (준감독 학습 알고리즘을 위한 능동적 레이블 데이터 선택)

  • Han, Ji-Ho;Park, Eun-Ae;Park, Dong-Chul;Lee, Yunsik;Min, Soo-Young
    • Journal of IKEEE
    • /
    • v.17 no.3
    • /
    • pp.254-259
    • /
    • 2013
  • The choice of labeled data in semi-supervised learning algorithm can result in effects on the performance of the resultant classifier. In order to select labeled data required for the training of a semi-supervised learning algorithm, VCNN(Vector Centroid Neural Network) is proposed in this paper. The proposed selection method of label data is evaluated on UCI dataset and caltech dataset. Experiments and results show that the proposed selection method outperforms conventional methods in terms of classification accuracy and minimum error rate.