• Title/Summary/Keyword: Linear Data Analysis

Search Result 3,496, Processing Time 0.037 seconds

Data Visualization using Linear and Non-linear Dimensionality Reduction Methods

  • Kim, Junsuk;Youn, Joosang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.12
    • /
    • pp.21-26
    • /
    • 2018
  • As the large amount of data can be efficiently stored, the methods extracting meaningful features from big data has become important. Especially, the techniques of converting high- to low-dimensional data are crucial for the 'Data visualization'. In this study, principal component analysis (PCA; linear dimensionality reduction technique) and Isomap (non-linear dimensionality reduction technique) are introduced and applied to neural big data obtained by the functional magnetic resonance imaging (fMRI). First, we investigate how much the physical properties of stimuli are maintained after the dimensionality reduction processes. We moreover compared the amount of residual variance to quantitatively compare the amount of information that was not explained. As result, the dimensionality reduction using Isomap contains more information than the principal component analysis. Our results demonstrate that it is necessary to consider not only linear but also nonlinear characteristics in the big data analysis.

Complex Segregation Analysis of Categorical Traits in Farm Animals: Comparison of Linear and Threshold Models

  • Kadarmideen, Haja N.;Ilahi, H.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.18 no.8
    • /
    • pp.1088-1097
    • /
    • 2005
  • Main objectives of this study were to investigate accuracy, bias and power of linear and threshold model segregation analysis methods for detection of major genes in categorical traits in farm animals. Maximum Likelihood Linear Model (MLLM), Bayesian Linear Model (BALM) and Bayesian Threshold Model (BATM) were applied to simulated data on normal, categorical and binary scales as well as to disease data in pigs. Simulated data on the underlying normally distributed liability (NDL) were used to create categorical and binary data. MLLM method was applied to data on all scales (Normal, categorical and binary) and BATM method was developed and applied only to binary data. The MLLM analyses underestimated parameters for binary as well as categorical traits compared to normal traits; with the bias being very severe for binary traits. The accuracy of major gene and polygene parameter estimates was also very low for binary data compared with those for categorical data; the later gave results similar to normal data. When disease incidence (on binary scale) is close to 50%, segregation analysis has more accuracy and lesser bias, compared to diseases with rare incidences. NDL data were always better than categorical data. Under the MLLM method, the test statistics for categorical and binary data were consistently unusually very high (while the opposite is expected due to loss of information in categorical data), indicating high false discovery rates of major genes if linear models are applied to categorical traits. With Bayesian segregation analysis, 95% highest probability density regions of major gene variances were checked if they included the value of zero (boundary parameter); by nature of this difference between likelihood and Bayesian approaches, the Bayesian methods are likely to be more reliable for categorical data. The BATM segregation analysis of binary data also showed a significant advantage over MLLM in terms of higher accuracy. Based on the results, threshold models are recommended when the trait distributions are discontinuous. Further, segregation analysis could be used in an initial scan of the data for evidence of major genes before embarking on molecular genome mapping.

Datawise Discriminant Analysis For Feature Extraction (자료별 분류분석(DDA)에 의한 특징추출)

  • Park, Myoung-Soo;Choi, Jin-Young
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.19 no.1
    • /
    • pp.90-95
    • /
    • 2009
  • This paper presents a new feature extraction algorithm which can deal with the problems of linear discriminant analysis, widely used for linear dimensionality reduction. The scatter matrices included in linear discriminant analysis are defined by the distances between each datum and its class mean, and those between class means and mean of whole data. Use of these scatter matrices can cause computational problems and the limitation on the number of features. In addition, these definition assumes that the data distribution is unimodal and normal, for the cases not satisfying this assumption the appropriate features are not achieved. In this paper we define a new scatter matrix which is based on the differently weighted distances between individual data, and presents a feature extraction algorithm using this scatter matrix. With this new method. the mentioned problems of linear discriminant analysis can be avoided, and the features appropriate for discriminating data can be achieved. The performance of this new method is shown by experiments.

An Approach to Applying Multiple Linear Regression Models by Interlacing Data in Classifying Similar Software

  • Lim, Hyun-il
    • Journal of Information Processing Systems
    • /
    • v.18 no.2
    • /
    • pp.268-281
    • /
    • 2022
  • The development of information technology is bringing many changes to everyday life, and machine learning can be used as a technique to solve a wide range of real-world problems. Analysis and utilization of data are essential processes in applying machine learning to real-world problems. As a method of processing data in machine learning, we propose an approach based on applying multiple linear regression models by interlacing data to the task of classifying similar software. Linear regression is widely used in estimation problems to model the relationship between input and output data. In our approach, multiple linear regression models are generated by training on interlaced feature data. A combination of these multiple models is then used as the prediction model for classifying similar software. Experiments are performed to evaluate the proposed approach as compared to conventional linear regression, and the experimental results show that the proposed method classifies similar software more accurately than the conventional model. We anticipate the proposed approach to be applied to various kinds of classification problems to improve the accuracy of conventional linear regression.

Efficiency of Aggregate Data in Non-linear Regression

  • Huh, Jib
    • Communications for Statistical Applications and Methods
    • /
    • v.8 no.2
    • /
    • pp.327-336
    • /
    • 2001
  • This work concerns estimating a regression function, which is not linear, using aggregate data. In much of the empirical research, data are aggregated for various reasons before statistical analysis. In a traditional parametric approach, a linear estimation of the non-linear function with aggregate data can result in unstable estimators of the parameters. More serious consequence is the bias in the estimation of the non-linear function. The approach we employ is the kernel regression smoothing. We describe the conditions when the aggregate data can be used to estimate the regression function efficiently. Numerical examples will illustrate our findings.

  • PDF

Objective Cloud Type Classification of Meteorological Satellite Data Using Linear Discriminant Analysis (선형판별법에 의한 GMS 영상의 객관적 운형분류)

  • 서애숙;김금란
    • Korean Journal of Remote Sensing
    • /
    • v.6 no.1
    • /
    • pp.11-24
    • /
    • 1990
  • This is the study about the meteorological satellite cloud image classification by objective methods. For objective cloud classification, linear discriminant analysis was tried. In the linear discriminant analysis 27 cloud characteristic parameters were retrieved from GMS infrared image data. And, linear cloud classification model was developed from major parameters and cloud type coefficients. The model was applied to GMS IR image for weather forecasting operation and cloud image was classified into 5 types such as Sc, Cu, CiT, CiM and Cb. The classification results were reasonably compared with real image.

Prediction of New Confirmed Cases of COVID-19 based on Multiple Linear Regression and Random Forest (다중 선형 회귀와 랜덤 포레스트 기반의 코로나19 신규 확진자 예측)

  • Kim, Jun Su;Choi, Byung-Jae
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.4
    • /
    • pp.249-255
    • /
    • 2022
  • The COVID-19 virus appeared in 2019 and is extremely contagious. Because it is very infectious and has a huge impact on people's mobility. In this paper, multiple linear regression and random forest models are used to predict the number of COVID-19 cases using COVID-19 infection status data (open source data provided by the Ministry of health and welfare) and Google Mobility Data, which can check the liquidity of various categories. The data has been divided into two sets. The first dataset is COVID-19 infection status data and all six variables of Google Mobility Data. The second dataset is COVID-19 infection status data and only two variables of Google Mobility Data: (1) Retail stores and leisure facilities (2) Grocery stores and pharmacies. The models' performance has been compared using the mean absolute error indicator. We also a correlation analysis of the random forest model and the multiple linear regression model.

Generalized linear models versus data transformation for the analysis of taguchi experiment (다구찌 실험분석에 있어서 일반화선형모형 대 자료변환)

  • 이영조
    • The Korean Journal of Applied Statistics
    • /
    • v.6 no.2
    • /
    • pp.253-263
    • /
    • 1993
  • Recent interest in Taguchi's methods have led to developments of joint modelling of the mean and dispersion in generalized linear models. Since a single data transformation cannot produce all the necessary conditions for an analysis, for the analysis of the Taguchi data, the use of the generalized linear models is preferred to a commonly used data transformation method. In this paper, we will illustrate this point and provide GLIM macros to implement the joint modelling of the mean and dispersion in generalized linear models.

  • PDF

Robustness of model averaging methods for the violation of standard linear regression assumptions

  • Lee, Yongsu;Song, Juwon
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.2
    • /
    • pp.189-204
    • /
    • 2021
  • In a regression analysis, a single best model is usually selected among several candidate models. However, it is often useful to combine several candidate models to achieve better performance, especially, in the prediction viewpoint. Model combining methods such as stacking and Bayesian model averaging (BMA) have been suggested from the perspective of averaging candidate models. When the candidate models include a true model, it is expected that BMA generally gives better performance than stacking. On the other hand, when candidate models do not include the true model, it is known that stacking outperforms BMA. Since stacking and BMA approaches have different properties, it is difficult to determine which method is more appropriate under other situations. In particular, it is not easy to find research papers that compare stacking and BMA when regression model assumptions are violated. Therefore, in the paper, we compare the performance among model averaging methods as well as a single best model in the linear regression analysis when standard linear regression assumptions are violated. Simulations were conducted to compare model averaging methods with the linear regression when data include outliers and data do not include them. We also compared them when data include errors from a non-normal distribution. The model averaging methods were applied to the water pollution data, which have a strong multicollinearity among variables. Simulation studies showed that the stacking method tends to give better performance than BMA or standard linear regression analysis (including the stepwise selection method) in the sense of risks (see (3.1)) or prediction error (see (3.2)) when typical linear regression assumptions are violated.

Generalized Linear Models for the Analysis of Data from the Quality-Improvement Experiments (일반화 선형모형을 통한 품질개선실험 자료분석)

  • Lee, Youngjo;Lim, Yong Bin
    • Journal of Korean Society for Quality Management
    • /
    • v.24 no.2
    • /
    • pp.128-141
    • /
    • 1996
  • The advent of the quality-improvement movement caused a great expansion in the use of statistically designed experiments in industry. The regression method is often used for the analysis of data from such experiments. However, the data for a quality characterstic often takes the form of counts or the ratio of counts, e.g. fraction of defectives. For such data the analysis using generalized linear models is preferred to that using the simple regression model. In this paper we introduce the generalized linear model and show how it can be used for the analysis of non-normal data from quality-improvement experiments.

  • PDF