• Title/Summary/Keyword: Unsupervised feature selection

Search Result 22, Processing Time 0.02 seconds

Feature Selection via Embedded Learning Based on Tangent Space Alignment for Microarray Data

  • Ye, Xiucai;Sakurai, Tetsuya
    • Journal of Computing Science and Engineering
    • /
    • v.11 no.4
    • /
    • pp.121-129
    • /
    • 2017
  • Feature selection has been widely established as an efficient technique for microarray data analysis. Feature selection aims to search for the most important feature/gene subset of a given dataset according to its relevance to the current target. Unsupervised feature selection is considered to be challenging due to the lack of label information. In this paper, we propose a novel method for unsupervised feature selection, which incorporates embedded learning and $l_{2,1}-norm$ sparse regression into a framework to select genes in microarray data analysis. Local tangent space alignment is applied during embedded learning to preserve the local data structure. The $l_{2,1}-norm$ sparse regression acts as a constraint to aid in learning the gene weights correlatively, by which the proposed method optimizes for selecting the informative genes which better capture the interesting natural classes of samples. We provide an effective algorithm to solve the optimization problem in our method. Finally, to validate the efficacy of the proposed method, we evaluate the proposed method on real microarray gene expression datasets. The experimental results demonstrate that the proposed method obtains quite promising performance.

Unsupervised Feature Selection Method Based on Principal Component Loading Vectors (주성분 분석 로딩 벡터 기반 비지도 변수 선택 기법)

  • Park, Young Joon;Kim, Seoung Bum
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.40 no.3
    • /
    • pp.275-282
    • /
    • 2014
  • One of the most widely used methods for dimensionality reduction is principal component analysis (PCA). However, the reduced dimensions from PCA do not provide a clear interpretation with respect to the original features because they are linear combinations of a large number of original features. This interpretation problem can be overcome by feature selection approaches that identifying the best subset of given features. In this study, we propose an unsupervised feature selection method based on the geometrical information of PCA loading vectors. Experimental results from a simulation study demonstrated the efficiency and usefulness of the proposed method.

Arabic Text Clustering Methods and Suggested Solutions for Theme-Based Quran Clustering: Analysis of Literature

  • Bsoul, Qusay;Abdul Salam, Rosalina;Atwan, Jaffar;Jawarneh, Malik
    • Journal of Information Science Theory and Practice
    • /
    • v.9 no.4
    • /
    • pp.15-34
    • /
    • 2021
  • Text clustering is one of the most commonly used methods for detecting themes or types of documents. Text clustering is used in many fields, but its effectiveness is still not sufficient to be used for the understanding of Arabic text, especially with respect to terms extraction, unsupervised feature selection, and clustering algorithms. In most cases, terms extraction focuses on nouns. Clustering simplifies the understanding of an Arabic text like the text of the Quran; it is important not only for Muslims but for all people who want to know more about Islam. This paper discusses the complexity and limitations of Arabic text clustering in the Quran based on their themes. Unsupervised feature selection does not consider the relationships between the selected features. One weakness of clustering algorithms is that the selection of the optimal initial centroid still depends on chances and manual settings. Consequently, this paper reviews literature about the three major stages of Arabic clustering: terms extraction, unsupervised feature selection, and clustering. Six experiments were conducted to demonstrate previously un-discussed problems related to the metrics used for feature selection and clustering. Suggestions to improve clustering of the Quran based on themes are presented and discussed.

Unsupervised feature selection using orthogonal decomposition and low-rank approximation

  • Lim, Hyunki
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.5
    • /
    • pp.77-84
    • /
    • 2022
  • In this paper, we propose a novel unsupervised feature selection method. Conventional unsupervised feature selection method defines virtual label and uses a regression analysis that projects the given data to this label. However, since virtual labels are generated from data, they can be formed similarly in the space. Thus, in the conventional method, the features can be selected in only restricted space. To solve this problem, in this paper, features are selected using orthogonal projections and low-rank approximations. To solve this problem, in this paper, a virtual label is projected to orthogonal space and the given data set is also projected to this space. Through this process, effective features can be selected. In addition, projection matrix is restricted low-rank to allow more effective features to be selected in low-dimensional space. To achieve these objectives, a cost function is designed and an efficient optimization method is proposed. Experimental results for six data sets demonstrate that the proposed method outperforms existing conventional unsupervised feature selection methods in most cases.

Unsupervised learning with hierarchical feature selection for DDoS mitigation within the ISP domain

  • Ko, Ili;Chambers, Desmond;Barrett, Enda
    • ETRI Journal
    • /
    • v.41 no.5
    • /
    • pp.574-584
    • /
    • 2019
  • A new Mirai variant found recently was equipped with a dynamic update ability, which increases the level of difficulty for DDoS mitigation. Continuous development of 5G technology and an increasing number of Internet of Things (IoT) devices connected to the network pose serious threats to cyber security. Therefore, researchers have tried to develop better DDoS mitigation systems. However, the majority of the existing models provide centralized solutions either by deploying the system with additional servers at the host site, on the cloud, or at third party locations, which may cause latency. Since Internet service providers (ISP) are links between the internet and users, deploying the defense system within the ISP domain is the panacea for delivering an efficient solution. To cope with the dynamic nature of the new DDoS attacks, we utilized an unsupervised artificial neural network to develop a hierarchical two-layered self-organizing map equipped with a twofold feature selection for DDoS mitigation within the ISP domain.

Feature Impact Evaluation Based Pattern Classification System

  • Rhee, Hyun-Sook
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.11
    • /
    • pp.25-30
    • /
    • 2018
  • Pattern classification system is often an important component of intelligent systems. In this paper, we present a pattern classification system consisted of the feature selection module, knowledge base construction module and decision module. We introduce a feature impact evaluation selection method based on fuzzy cluster analysis considering computational approach and generalization capability of given data characteristics. A fuzzy neural network, OFUN-NET based on unsupervised learning data mining technique produces knowledge base for representative clusters. 240 blemish pattern images are prepared and applied to the proposed system. Experimental results show the feasibility of the proposed classification system as an automating defect inspection tool.

Information-based Supervised and Unsupervised Feature Selection Methods (정보이론에 기반한 Supervised, Unsupervised 피처 선택 방법론)

  • 이상근;장병탁
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04b
    • /
    • pp.637-639
    • /
    • 2004
  • 많은 변수(variable)라 피처(feature)를 포함하는 대규모 데이터에 기계학습 방법론을 적용하는데 있어 그 예측 성능을 향상시키기 위한 방법으로 피처 선택(feature selection)기법이 활발히 연구되고 있다. 그러나 다른 연구를 위한 사전 데이터 분석 작업에 유용하게 사용될 수 있는 단순한 순위기반 피처 선택 방법론은 피처의 중요한 특성을 간과하는 경우가 많으며, 따라서 예측 성능의 향상을 기대하기 어렵다. 본 연구에서는 정보 이론에 기반한 supervised 피처 선택 방법과 이것을 보완할 수 있는 unsupervised 피처 선택 방법을 제시했다. 서로 다른 특성을 가진 다섯 개의 데이터셋에 대해 실험한 결과. 제시된 방법이 기존 방법보다 나은 예측 성능을 보임을 확인했다. 또한 두 방법에서 얻어진 피처들을 결합해 사용할 경우 한가지 방법만으로 추출된 피처를 사용할 경우보다 나은 기계 학습 성능을 보임을 확인했다.

  • PDF

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data (고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법)

  • Lee, Changki;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.47 no.3
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

Structural Quality Defect Discrimination Enhancement using Vertical Energy-based Wavelet Feature Generation (구조물의 품질 결함 변별력 증대를 위한 수직 에너지 기반의 웨이블릿 Feature 생성)

  • Kim, Joon-Seok;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.36 no.2
    • /
    • pp.36-44
    • /
    • 2008
  • In this paper a novel feature extraction and selection is carried out in order to improve the discriminating capability between healthy and damaged structure using vibration signals. Although many feature extraction and selection algorithms have been proposed for vibration signals, most proposed approaches don't consider the discriminating ability of features since they are usually in unsupervised manner. We proposed a novel feature extraction and selection algorithm selecting few wavelet coefficients with higher class discriminating capability for damage detection and class visualization. We applied three class separability measures to evaluate the features, i.e. T test statistics, divergence, and Bhattacharyya distance. Experiments with vibration signals from truss structure demonstrate that class separabilities are significantly enhanced using our proposed algorithm compared to other two algorithms with original time-based features and Fourier-based ones.

Feature selection for text data via topic modeling (토픽 모형을 이용한 텍스트 데이터의 단어 선택)

  • Woosol, Jang;Ye Eun, Kim;Won, Son
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.6
    • /
    • pp.739-754
    • /
    • 2022
  • Usually, text data consists of many variables, and some of them are closely correlated. Such multi-collinearity often results in inefficient or inaccurate statistical analysis. For supervised learning, one can select features by examining the relationship between target variables and explanatory variables. On the other hand, for unsupervised learning, since target variables are absent, one cannot use such a feature selection procedure as in supervised learning. In this study, we propose a word selection procedure that employs topic models to find latent topics. We substitute topics for the target variables and select terms which show high relevance for each topic. Applying the procedure to real data, we found that the proposed word selection procedure can give clear topic interpretation by removing high-frequency words prevalent in various topics. In addition, we observed that, by applying the selected variables to the classifiers such as naïve Bayes classifiers and support vector machines, the proposed feature selection procedure gives results comparable to those obtained by using class label information.