• Title/Summary/Keyword: dimension reduction based methods

Search Result 61, Processing Time 0.029 seconds

Variational Autoencoder Based Dimension Reduction and Clustering for Single-Cell RNA-seq Gene Expression (단일세포 RNA-SEQ의 유전자 발현 군집화를 위한 변이 자동인코더 기반의 차원감소와 군집화)

  • Chi, Sang-Mun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.11
    • /
    • pp.1512-1518
    • /
    • 2021
  • Since single cell RNA sequencing provides the expression profiles of individual cells, it provides higher cellular differential resolution than traditional bulk RNA sequencing. Using these single cell RNA sequencing data, clustering analysis is generally conducted to find cell types and understand high level biological processes. In order to effectively process the high-dimensional single cell RNA sequencing data fir the clustering analysis, this paper uses a variational autoencoder to transform a high dimensional data space into a lower dimensional latent space, expecting to produce a latent space that can give more accurate clustering results. By clustering the features in the transformed latent space, we compare the performance of various classical clustering methods for single cell RNA sequencing data. Experimental results demonstrate that the proposed framework outperforms many state-of-the-art methods under various clustering performance metrics.

Penalized logistic regression using functional connectivity as covariates with an application to mild cognitive impairment

  • Jung, Jae-Hwan;Ji, Seong-Jin;Zhu, Hongtu;Ibrahim, Joseph G.;Fan, Yong;Lee, Eunjee
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.6
    • /
    • pp.603-624
    • /
    • 2020
  • There is an emerging interest in brain functional connectivity (FC) based on functional Magnetic Resonance Imaging in Alzheimer's disease (AD) studies. The complex and high-dimensional structure of FC makes it challenging to explore the association between altered connectivity and AD susceptibility. We develop a pipeline to refine FC as proper covariates in a penalized logistic regression model and classify normal and AD susceptible groups. Three different quantification methods are proposed for FC refinement. One of the methods is dimension reduction based on common component analysis (CCA), which is employed to address the limitations of the other methods. We applied the proposed pipeline to the Alzheimer's Disease Neuroimaging Initiative (ADNI) data and deduced pathogenic FC biomarkers associated with AD susceptibility. The refined FC biomarkers were related to brain regions for cognition, stimuli processing, and sensorimotor skills. We also demonstrated that a model using CCA performed better than others in terms of classification performance and goodness-of-fit.

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

Eigen Palmprint Identification Algorithm using PCA(Principal Components Analysis) (주성분 분석법을 이용한 고유장문 인식 알고리즘)

  • Noh Jin-Soo;Rhee Kang-Hyeon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.43 no.3 s.309
    • /
    • pp.82-89
    • /
    • 2006
  • Palmprint-based personal identification system, as a new member in the biometrics system family, has become an active research topic in recent years. Although lots of methods have been made, how to represent palmprint for effective classification is still an open problem and conducting researches. In this paper, the palmprint classification and recognition method based on PCA (Principal Components Analysis) using the dimension reduction of singular vector is proposed. And the 135dpi palmprint image which is obtained by the palmprint acquisition device is used for the effectual palmprint recognition system. The proposed system is consists of the palmprint acquisition device, DB generation algorithm and the palmprint recognition algorithm. The palmprint recognition step is limited 2 times. As a results GAR and FAR are 98.5% and 0.036%.

Sensor placement for structural health monitoring of Canton Tower

  • Yi, Ting-Hua;Li, Hong-Nan;Gu, Ming
    • Smart Structures and Systems
    • /
    • v.10 no.4_5
    • /
    • pp.313-329
    • /
    • 2012
  • A challenging issue in design and implementation of an effective structural health monitoring (SHM) system is to determine where a number of sensors are properly installed. In this paper, research on the optimal sensor placement (OSP) is carried out on the Canton Tower (formerly named Guangzhou New Television Tower) of 610 m high. To avoid the intensive computationally-demanding problem caused by tens of thousands of degrees of freedom (DOFs) involved in the dynamic analysis, the three dimension finite element (FE) model of the Canton Tower is first simplified to a system with less DOFs. Considering that the sensors can be physically arranged only in the translational DOFs of the structure, but not in the rotational DOFs, a new method of taking the horizontal DOF as the master DOF and rotational DOF as the slave DOF, and reducing the slave DOF by model reduction is proposed. The reduced model is obtained by IIRS method and compared with the models reduced by Guyan, Kuhar, and IRS methods. Finally, the OSP of the Canton Tower is obtained by a kind of dual-structure coding based generalized genetic algorithm (GGA).

Probabilistic penalized principal component analysis

  • Park, Chongsun;Wang, Morgan C.;Mo, Eun Bi
    • Communications for Statistical Applications and Methods
    • /
    • v.24 no.2
    • /
    • pp.143-154
    • /
    • 2017
  • A variable selection method based on probabilistic principal component analysis (PCA) using penalized likelihood method is proposed. The proposed method is a two-step variable reduction method. The first step is based on the probabilistic principal component idea to identify principle components. The penalty function is used to identify important variables in each component. We then build a model on the original data space instead of building on the rotated data space through latent variables (principal components) because the proposed method achieves the goal of dimension reduction through identifying important observed variables. Consequently, the proposed method is of more practical use. The proposed estimators perform as the oracle procedure and are root-n consistent with a proper choice of regularization parameters. The proposed method can be successfully applied to high-dimensional PCA problems with a relatively large portion of irrelevant variables included in the data set. It is straightforward to extend our likelihood method in handling problems with missing observations using EM algorithms. Further, it could be effectively applied in cases where some data vectors exhibit one or more missing values at random.

Clustering Algorithm for Time Series with Similar Shapes

  • Ahn, Jungyu;Lee, Ju-Hong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.7
    • /
    • pp.3112-3127
    • /
    • 2018
  • Since time series clustering is performed without prior information, it is used for exploratory data analysis. In particular, clusters of time series with similar shapes can be used in various fields, such as business, medicine, finance, and communications. However, existing time series clustering algorithms have a problem in that time series with different shapes are included in the clusters. The reason for such a problem is that the existing algorithms do not consider the limitations on the size of the generated clusters, and use a dimension reduction method in which the information loss is large. In this paper, we propose a method to alleviate the disadvantages of existing methods and to find a better quality of cluster containing similarly shaped time series. In the data preprocessing step, we normalize the time series using z-transformation. Then, we use piecewise aggregate approximation (PAA) to reduce the dimension of the time series. In the clustering step, we use density-based spatial clustering of applications with noise (DBSCAN) to create a precluster. We then use a modified K-means algorithm to refine the preclusters containing differently shaped time series into subclusters containing only similarly shaped time series. In our experiments, our method showed better results than the existing method.

Facial Feature Extraction Using Energy Probability in Frequency Domain (주파수 영역에서 에너지 확률을 이용한 얼굴 특징 추출)

  • Choi Jean;Chung Yns-Su;Kim Ki-Hyun;Yoo Jang-Hee
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.43 no.4 s.310
    • /
    • pp.87-95
    • /
    • 2006
  • In this paper, we propose a novel feature extraction method for face recognition, based on Discrete Cosine Transform (DCT), Energy Probability (EP), and Linear Discriminant Analysis (LDA). We define an energy probability as magnitude of effective information and it is used to create a frequency mask in OCT domain. The feature extraction method consists of three steps; i) the spatial domain of face images is transformed into the frequency domain called OCT domain; ii) energy property is applied on DCT domain that acquire from face image for the purpose of dimension reduction of data and optimization of valid information; iii) in order to obtain the most significant and invariant feature of face images, LDA is applied to the data extracted using frequency mask. In experiments, the recognition rate is 96.8% in ETRI database and 100% in ORL database. The proposed method has been shown improvements on the dimension reduction of feature space and the face recognition over the previously proposed methods.

A personalized exercise recommendation system using dimension reduction algorithms

  • Lee, Ha-Young;Jeong, Ok-Ran
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.6
    • /
    • pp.19-28
    • /
    • 2021
  • Nowadays, interest in health care is increasing due to Coronavirus (COVID-19), and a lot of people are doing home training as there are more difficulties in using fitness centers and public facilities that are used together. In this paper, we propose a personalized exercise recommendation algorithm using personalized propensity information to provide more accurate and meaningful exercise recommendation to home training users. Thus, we classify the data according to the criteria for obesity with a k-nearest neighbor algorithm using personal information that can represent individuals, such as eating habits information and physical conditions. Furthermore, we differentiate the exercise dataset by the level of exercise activities. Based on the neighborhood information of each dataset, we provide personalized exercise recommendations to users through a dimensionality reduction algorithm (SVD) among model-based collaborative filtering methods. Therefore, we can solve the problem of data sparsity and scalability of memory-based collaborative filtering recommendation techniques and we verify the accuracy and performance of the proposed algorithms.

Automatic Generation of Mid-Surfaces of Solid Models by Maximal Volume Decomposition (최대볼륨분해 방법을 이용한 중립면 모델의 자동생성)

  • Woo, Yoon-Hwan;Choo, Chang-Upp
    • Korean Journal of Computational Design and Engineering
    • /
    • v.14 no.5
    • /
    • pp.297-305
    • /
    • 2009
  • Automatic generation of the mid-surfaces of a CAD model is becoming a useful function in that it can help increase the efficiency of engineering analysis as far as it does not affect the result seriously. Several methods had been proposed previously to automatically generate the mid-surfaces, but they often failed to generate the mid-surfaces of complex CAD models. Due to the inherent difficulty of this mid-surface generation problem, it may not be possible to come up with a complete and general method to solve this problem. Since a method that can handle a specific case may not work for different cases, it seems that developing case-specific methods ends up with solving only a fraction of the problem. In this paper, therefore, we propose a method to generate mid-surfaces based on a divide-and-conquer paradigm. This method first decomposes a complex CAD model into simple volumes. The mid-surfaces of the simple volumes are automatically generated by the existing methods, and then they are converted into the mid-surfaces of the original CAD model.