• Title/Summary/Keyword: Regression Study

Search Result 28,750, Processing Time 0.049 seconds

Fast robust variable selection using VIF regression in large datasets (대형 데이터에서 VIF회귀를 이용한 신속 강건 변수선택법)

  • Seo, Han Son
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.4
    • /
    • pp.463-473
    • /
    • 2018
  • Variable selection algorithms for linear regression models of large data are considered. Many algorithms are proposed focusing on the speed and the robustness of algorithms. Among them variance inflation factor (VIF) regression is fast and accurate due to the use of a streamwise regression approach. But a VIF regression is susceptible to outliers because it estimates a model by a least-square method. A robust criterion using a weighted estimator has been proposed for the robustness of algorithm; in addition, a robust VIF regression has also been proposed for the same purpose. In this article a fast and robust variable selection method is suggested via a VIF regression with detecting and removing potential outliers. A simulation study and an analysis of a dataset are conducted to compare the suggested method with other methods.

A study on semi-supervised kernel ridge regression estimation (준지도 커널능형회귀모형에 관한 연구)

  • Seok, Kyungha
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.2
    • /
    • pp.341-353
    • /
    • 2013
  • In many practical machine learning and data mining applications, unlabeled data are inexpensive and easy to obtain. Semi-supervised learning try to use such data to improve prediction performance. In this paper, a semi-supervised regression method, semi-supervised kernel ridge regression estimation, is proposed on the basis of kernel ridge regression model. The proposed method does not require a pilot estimation of the label of the unlabeled data. This means that the proposed method has good advantages including less number of parameters, easy computing and good generalization ability. Experiments show that the proposed method can effectively utilize unlabeled data to improve regression estimation.

Development and Evaluation of Simple Regression Model and Multiple Regression Model for TOC Contentation Estimation in Stream Flow (하천수내 TOC 농도 추정을 위한 단순회귀모형과 다중회귀모형의 개발과 평가)

  • Jung, Jaewoon;Cho, Sohyun;Choi, Jinhee;Kim, Kapsoon;Jung, Soojung;Lim, Byungjin
    • Journal of Korean Society on Water Environment
    • /
    • v.29 no.5
    • /
    • pp.625-629
    • /
    • 2013
  • The objective of this study is to develop and evaluate simple and multiple regression models for Total Organic Carbon (TOC) concentration estimation in stream flow. For development (using water quality data in 2012) and evaluation (using water quality data in 2011) of regression models, we used water quality data from downstream of Yeongsan river basin during 2011 and 2012, and correlation analysis between TOC and water quality parameters was conducted. The concentrations of TOC were positively correlated with Chemical Oxygen Demand (COD), Biochemical Oxygen Demand (BOD), TN (Total Nitrogen), Water Temperature (WT) and Electric Conductivity (EC). From these results, simple and multiple regression models for TOC estimation were developed as follows : $TOC=0.5809{\times}BOD+3.1557$, $TOC=0.4365{\times}COD+1.3731$. As a result of the application evaluation of the developed regression models, the multiple regression model was found to estimate TOC better than simple regression models.

Symbolic regression based on parallel Genetic Programming (병렬 유전자 프로그래밍을 이용한 Symbolic Regression)

  • Kim, Chansoo;Han, Keunhee
    • Journal of Digital Convergence
    • /
    • v.18 no.12
    • /
    • pp.481-488
    • /
    • 2020
  • Symbolic regression is an analysis method that directly generates a function that can explain the relationsip between dependent and independent variables for a given data in regression analysis. Genetic Programming is the leading technology of research in this field. It has the advantage of being able to directly derive a model that can be interpreted compared to other regression analysis algorithms that seek to optimize parameters from a fixed model. In this study, we propse a symbolic regression algorithm using parallel genetic programming based on a coarse grained parallel model, and apply the proposed algorithm to PMLB data to analyze the effectiveness of the algorithm.

Development of Medical Cost Prediction Model Based on the Machine Learning Algorithm (머신러닝 알고리즘 기반의 의료비 예측 모델 개발)

  • Han Bi KIM;Dong Hoon HAN
    • Journal of Korea Artificial Intelligence Association
    • /
    • v.1 no.1
    • /
    • pp.11-16
    • /
    • 2023
  • Accurate hospital case modeling and prediction are crucial for efficient healthcare. In this study, we demonstrate the implementation of regression analysis methods in machine learning systems utilizing mathematical statics and machine learning techniques. The developed machine learning model includes Bayesian linear, artificial neural network, decision tree, decision forest, and linear regression analysis models. Through the application of these algorithms, corresponding regression models were constructed and analyzed. The results suggest the potential of leveraging machine learning systems for medical research. The experiment aimed to create an Azure Machine Learning Studio tool for the speedy evaluation of multiple regression models. The tool faciliates the comparision of 5 types of regression models in a unified experiment and presents assessment results with performance metrics. Evaluation of regression machine learning models highlighted the advantages of boosted decision tree regression, and decision forest regression in hospital case prediction. These findings could lay the groundwork for the deliberate development of new directions in medical data processing and decision making. Furthermore, potential avenues for future research may include exploring methods such as clustering, classification, and anomaly detection in healthcare systems.

Regression Analysis Between Specific Sediments of Reservoirs and Physiographic Factors of Watersheds (유역의 지상적 요인과 저수지 비퇴사량과의 관계분석)

  • 서승덕;박흥익;천만복;윤경덕
    • Magazine of the Korean Society of Agricultural Engineers
    • /
    • v.30 no.4
    • /
    • pp.45-61
    • /
    • 1988
  • The purpose of this study is to develop regression equations between annual specific sedi- ment of reservoirs and physiographic factors of watersheds. 122 irrigation reservoirs, which have irrigation areas equal to or larger than 200 ha, located in Korea except Cheju province are used in the analysis. Simple regression analyses between the specific annual sediment and each of the physical characteristic factors of the reservoirs are carried out at first. Then, multiple regression analyses between the annual specific sediment and the physical characteristic factors with high correlation coefficients in the simple regression analyses are made. The results obtained from this study are as follows : 1. The results of the sirnple regression analyses show that in each province the watershed area, the length of mainstream, the circumferential length of watershed have high cor- relation coefficients (R=0.814-0.986), and that drainage density, reservoir capacity per watershed area, drainage frequency, basin relief have low correlation coefficients (R=0. 387-0.955). 2. The purposed multiple regression equations between the annual specific sediment of reservoirs and three major characteritic factors of watersheds, namely, the watershed area, the circumferential length of watershed, and the length of mainstream, are proposed as given in Table 2. 3. The result of the simple regression analyses with respect to the reservoir elevation except Jeonnam province, which has very different characteristics comparing to other provinces, shows that watershed area, main stream length and circumferential length have high correlation coefficients (R=0.806-0.884) in low-elevation reservoirs and intermediate- elevation reservoirs, but low correlation coefficients (R=0.639-0.739) in high-elevation reservoirs. 4. With respect to the reservoir elevation, the proposed multiple regression equations bet- ween the annual specific sediment of reservoirs and the three major characteristic factors of watershed which have high correlation coefficients are proposed as given in Table 5.

  • PDF

A Technique to Improve the Fit of Linear Regression Models for Successive Sets of Data

  • Park, Sung H.
    • Journal of the Korean Statistical Society
    • /
    • v.5 no.1
    • /
    • pp.19-28
    • /
    • 1976
  • In empirical study for fitting a multiple linear regression model for successive cross-sections data observed on the same set of independent variables over several time periods, one often faces the problem of poor $R^2$, the multiple coefficient of determination, which provides a standard measure of how good a specified regression line fits the sample data.

  • PDF

A Study on Deduction of Reasonable Inspection Time in Educational Facilities Based on Regression Analysis (회귀분석을 이용한 합리적인 교육시설물 점검시간의 산정 기준 도출에 관한 연구)

  • Seok, Hyun-Su;Cho, Chang-Yeon;Kim, Jae-On;Son, Jae-Ho
    • Journal of the Korean Institute of Educational Facilities
    • /
    • v.14 no.4
    • /
    • pp.34-42
    • /
    • 2007
  • The investment of BTL(Build-Transfer-Lease) projects for the educational facilities has been increased since 2005. However, many trial errors and problems have been occurring due to the lack of inspection criteria in maintenance cost for the lift cycle of the facilities. Thus, purpose of this study is to derive regression equation for the daily inspection and the regular inspection through analysis of CAD data in order to set the inspection criteria. It is possible to calculate economical maintenance cost of BTL project in educational facilities using criteria of inspection time through the regression equation developed in this study.

Understanding of Rett Syndrome (레트 증후군의 이해)

  • Ro, Hyo-Lyun
    • Journal of the Korean Society of Physical Medicine
    • /
    • v.2 no.1
    • /
    • pp.85-91
    • /
    • 2007
  • Purpose : The purpose of this study is understanding of Rett Syndrome. Rett Syndrome is a common developmental - neurologic disorder that has been reported almost exclusively in female. Recently mutations in the gene encoding X-linked methyl-CpG binding protein 2 (MECP2) have been identified as the cause of Rett syndrome. Consistent with the diagnostic criteria, hand skills, verbal or non - verbal communication skills and common motor skills were lost during regression. Regression most commonly occurred between 12 and 18 months of age. Methods : This is a literature study with books, articles, web site for Rett syndrome international association. Results : There is a continuing need to further elucidate the pre- and post - regression features of Rett syndrome. Rett syndrome need to physical therapy, musical therapy, special education and medical interventions. Conclusion : There has not been therapeutic method to the root of Rett syndrome but our goal is relaxation of symptom and physical therapist's study of Rett syndrome.

  • PDF

Comparison Study for Data Fusion and Clustering Classification Performances (다구찌 디자인을 이용한 데이터 퓨전 및 군집분석 분류 성능 비교)

  • 신형원;손소영
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 2000.04a
    • /
    • pp.601-604
    • /
    • 2000
  • In this paper, we compare the classification performance of both data fusion and clustering algorithms (Data Bagging, Variable Selection Bagging, Parameter Combining, Clustering) to logistic regression in consideration of various characteristics of input data. Four factors used to simulate the logistic model are (1) correlation among input variables (2) variance of observation (3) training data size and (4) input-output function. Since the relationship between input & output is not typically known, we use Taguchi design to improve the practicality of our study results by letting it as a noise factor. Experimental study results indicate the following: Clustering based logistic regression turns out to provide the highest classification accuracy when input variables are weakly correlated and the variance of data is high. When there is high correlation among input variables, variable bagging performs better than logistic regression. When there is strong correlation among input variables and high variance between observations, bagging appears to be marginally better than logistic regression but was not significant.

  • PDF