• Title/Summary/Keyword: Data sparsity

Search Result 174, Processing Time 0.028 seconds

Comparison of Lasso Type Estimators for High-Dimensional Data

  • Kim, Jaehee
    • Communications for Statistical Applications and Methods
    • /
    • v.21 no.4
    • /
    • pp.349-361
    • /
    • 2014
  • This paper compares of lasso type estimators in various high-dimensional data situations with sparse parameters. Lasso, adaptive lasso, fused lasso and elastic net as lasso type estimators and ridge estimator are compared via simulation in linear models with correlated and uncorrelated covariates and binary regression models with correlated covariates and discrete covariates. Each method is shown to have advantages with different penalty conditions according to sparsity patterns of regression parameters. We applied the lasso type methods to Arabidopsis microarray gene expression data to find the strongly significant genes to distinguish two groups.

Dense Sub-Cube Extraction Algorithm for a Multidimensional Large Sparse Data Cube (다차원 대용량 저밀도 데이타 큐브에 대한 고밀도 서브 큐브 추출 알고리즘)

  • Lee Seok-Lyong;Chun Seok-Ju;Chung Chin-Wan
    • Journal of KIISE:Databases
    • /
    • v.33 no.4
    • /
    • pp.353-362
    • /
    • 2006
  • A data warehouse is a data repository that enables users to store large volume of data and to analyze it effectively. In this research, we investigate an algorithm to establish a multidimensional data cube which is a powerful analysis tool for the contents of data warehouses and databases. There exists an inevitable retrieval overhead in a multidimensional data cube due to the sparsity of the cube. In this paper, we propose a dense sub-cube extraction algorithm that identifies dense regions from a large sparse data cube and constructs the sub-cubes based on the dense regions found. It reduces the retrieval overhead remarkably by retrieving those small dense sub-cubes instead of scanning a large sparse cube. The algorithm utilizes the bitmap and histogram based techniques to extract dense sub-cubes from the data cube, and its effectiveness is demonstrated via an experiment.

Missing Data Modeling based on Matrix Factorization of Implicit Feedback Dataset (암시적 피드백 데이터의 행렬 분해 기반 누락 데이터 모델링)

  • Ji, JiaQi;Chung, Yeongjee
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.5
    • /
    • pp.495-507
    • /
    • 2019
  • Data sparsity is one of the main challenges for the recommender system. The recommender system contains massive data in which only a small part is the observed data and the others are missing data. Most studies assume that missing data is randomly missing from the dataset. Therefore, they only use observed data to train recommendation model, then recommend items to users. In actual case, however, missing data do not lost randomly. In our research, treat these missing data as negative examples of users' interest. Three sample methods are seamlessly integrated into SVD++ algorithm and then propose SVD++_W, SVD++_R and SVD++_KNN algorithm. Experimental results show that proposed sample methods effectively improve the precision in Top-N recommendation over the baseline algorithms. Among the three improved algorithms, SVD++_KNN has the best performance, which shows that the KNN sample method is a more effective way to extract the negative examples of the users' interest.

MP-Lasso chart: a multi-level polar chart for visualizing group Lasso analysis of genomic data

  • Min Song;Minhyuk Lee;Taesung Park;Mira Park
    • Genomics & Informatics
    • /
    • v.20 no.4
    • /
    • pp.48.1-48.7
    • /
    • 2022
  • Penalized regression has been widely used in genome-wide association studies for joint analyses to find genetic associations. Among penalized regression models, the least absolute shrinkage and selection operator (Lasso) method effectively removes some coefficients from the model by shrinking them to zero. To handle group structures, such as genes and pathways, several modified Lasso penalties have been proposed, including group Lasso and sparse group Lasso. Group Lasso ensures sparsity at the level of pre-defined groups, eliminating unimportant groups. Sparse group Lasso performs group selection as in group Lasso, but also performs individual selection as in Lasso. While these sparse methods are useful in high-dimensional genetic studies, interpreting the results with many groups and coefficients is not straightforward. Lasso's results are often expressed as trace plots of regression coefficients. However, few studies have explored the systematic visualization of group information. In this study, we propose a multi-level polar Lasso (MP-Lasso) chart, which can effectively represent the results from group Lasso and sparse group Lasso analyses. An R package to draw MP-Lasso charts was developed. Through a real-world genetic data application, we demonstrated that our MP-Lasso chart package effectively visualizes the results of Lasso, group Lasso, and sparse group Lasso.

Temporal Interval Refinement for Point-of-Interest Recommendation (장소 추천을 위한 방문 간격 보정)

  • Kim, Minseok;Lee, Jae-Gil
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.86-98
    • /
    • 2018
  • Point-of-Interest(POI) recommendation systems suggest the most interesting POIs to users considering the current location and time. With the rapid development of smartphones, internet-of-things, and location-based social networks, it has become feasible to accumulate huge amounts of user POI visits. Therefore, instant recommendation of interesting POIs at a given time is being widely recognized as important. To increase the performance of POI recommendation systems, several studies extracting users' POI sequential preference from POI check-in data, which is intended for implicit feedback, have been suggested. However, when constructing a model utilizing sequential preference, the model encounters possibility of data distortion because of a low number of observed check-ins which is attributed to intensified data sparsity. This paper suggests refinement of temporal intervals based on data confidence. When building a POI recommendation system using temporal intervals to model the POI sequential preference of users, our methodology reduces potential data distortion in the dataset and thus increases the performance of the recommendation system. We verify our model's effectiveness through the evaluation with the Foursquare and Gowalla dataset.

Image Denoising for Metal MRI Exploiting Sparsity and Low Rank Priors

  • Choi, Sangcheon;Park, Jun-Sik;Kim, Hahnsung;Park, Jaeseok
    • Investigative Magnetic Resonance Imaging
    • /
    • v.20 no.4
    • /
    • pp.215-223
    • /
    • 2016
  • Purpose: The management of metal-induced field inhomogeneities is one of the major concerns of distortion-free magnetic resonance images near metallic implants. The recently proposed method called "Slice Encoding for Metal Artifact Correction (SEMAC)" is an effective spin echo pulse sequence of magnetic resonance imaging (MRI) near metallic implants. However, as SEMAC uses the noisy resolved data elements, SEMAC images can have a major problem for improving the signal-to-noise ratio (SNR) without compromising the correction of metal artifacts. To address that issue, this paper presents a novel reconstruction technique for providing an improvement of the SNR in SEMAC images without sacrificing the correction of metal artifacts. Materials and Methods: Low-rank approximation in each coil image is first performed to suppress the noise in the slice direction, because the signal is highly correlated between SEMAC-encoded slices. Secondly, SEMAC images are reconstructed by the best linear unbiased estimator (BLUE), also known as Gauss-Markov or weighted least squares. Noise levels and correlation in the receiver channels are considered for the sake of SNR optimization. To this end, since distorted excitation profiles are sparse, $l_1$ minimization performs well in recovering the sparse distorted excitation profiles and the sparse modeling of our approach offers excellent correction of metal-induced distortions. Results: Three images reconstructed using SEMAC, SEMAC with the conventional two-step noise reduction, and the proposed image denoising for metal MRI exploiting sparsity and low rank approximation algorithm were compared. The proposed algorithm outperformed two methods and produced 119% SNR better than SEMAC and 89% SNR better than SEMAC with the conventional two-step noise reduction. Conclusion: We successfully demonstrated that the proposed, novel algorithm for SEMAC, if compared with conventional de-noising methods, substantially improves SNR and reduces artifacts.

Stacked Sparse Autoencoder-DeepCNN Model Trained on CICIDS2017 Dataset for Network Intrusion Detection (네트워크 침입 탐지를 위해 CICIDS2017 데이터셋으로 학습한 Stacked Sparse Autoencoder-DeepCNN 모델)

  • Lee, Jong-Hwa;Kim, Jong-Wouk;Choi, Mi-Jung
    • KNOM Review
    • /
    • v.24 no.2
    • /
    • pp.24-34
    • /
    • 2021
  • Service providers using edge computing provide a high level of service. As a result, devices store important information in inner storage and have become a target of the latest cyberattacks, which are more difficult to detect. Although experts use a security system such as intrusion detection systems, the existing intrusion systems have low detection accuracy. Therefore, in this paper, we proposed a machine learning model for more accurate intrusion detections of devices in edge computing. The proposed model is a hybrid model that combines a stacked sparse autoencoder (SSAE) and a convolutional neural network (CNN) to extract important feature vectors from the input data using sparsity constraints. To find the optimal model, we compared and analyzed the performance as adjusting the sparsity coefficient of SSAE. As a result, the model showed the highest accuracy as a 96.9% using the sparsity constraints. Therefore, the model showed the highest performance when model trains only important features.

Sparse kernel classication using IRWLS procedure

  • Kim, Dae-Hak
    • Journal of the Korean Data and Information Science Society
    • /
    • v.20 no.4
    • /
    • pp.749-755
    • /
    • 2009
  • Support vector classification (SVC) provides more complete description of the lin-ear and nonlinear relationships between input vectors and classifiers. In this paper. we propose the sparse kernel classifier to solve the optimization problem of classification with a modified hinge loss function and absolute loss function, which provides the efficient computation and the sparsity. We also introduce the generalized cross validation function to select the hyper-parameters which affects the classification performance of the proposed method. Experimental results are then presented which illustrate the performance of the proposed procedure for classification.

  • PDF

On the Fitting ANOVA Models to Unbalanced Data

  • Jong-Tae Park;Jae-Heon Lee;Byung-Chun Kim
    • Communications for Statistical Applications and Methods
    • /
    • v.2 no.1
    • /
    • pp.48-54
    • /
    • 1995
  • A direct method for fitting analysis-of-variance models to unbalanced data is presented. This method exploits sparsity and rank deficiency of the matrix and is based on Gram-Schmidt orthogonalization of a set of sparse columns of the model matrix. The computational algorithm of the sum of squares for testing estmable hyphotheses is given.

  • PDF

Shifted Nadaraya Watson Estimator

  • Chung, Sung-S.
    • Communications for Statistical Applications and Methods
    • /
    • v.4 no.3
    • /
    • pp.881-890
    • /
    • 1997
  • The local linear estimator usually has more attractive properties than Nadaraya-Watson estimator. But the local linear estimator gives bad performance where data are sparse. Muller and Song proposed Shifted Nadaraya Watson estimator which has treated data sparsity well. We show that Shifted Nadaraya Watson estimator has good performance not only in the sparse region but also in the dense region, through the simulation study. Ans we suggest the boundary treatment of Shifted Nadaraya Watson estimator.

  • PDF