• 제목/요약/키워드: Unsupervised feature selection

검색결과 22건 처리시간 0.027초

Feature Selection via Embedded Learning Based on Tangent Space Alignment for Microarray Data

  • Ye, Xiucai;Sakurai, Tetsuya
    • Journal of Computing Science and Engineering
    • /
    • 제11권4호
    • /
    • pp.121-129
    • /
    • 2017
  • Feature selection has been widely established as an efficient technique for microarray data analysis. Feature selection aims to search for the most important feature/gene subset of a given dataset according to its relevance to the current target. Unsupervised feature selection is considered to be challenging due to the lack of label information. In this paper, we propose a novel method for unsupervised feature selection, which incorporates embedded learning and $l_{2,1}-norm$ sparse regression into a framework to select genes in microarray data analysis. Local tangent space alignment is applied during embedded learning to preserve the local data structure. The $l_{2,1}-norm$ sparse regression acts as a constraint to aid in learning the gene weights correlatively, by which the proposed method optimizes for selecting the informative genes which better capture the interesting natural classes of samples. We provide an effective algorithm to solve the optimization problem in our method. Finally, to validate the efficacy of the proposed method, we evaluate the proposed method on real microarray gene expression datasets. The experimental results demonstrate that the proposed method obtains quite promising performance.

주성분 분석 로딩 벡터 기반 비지도 변수 선택 기법 (Unsupervised Feature Selection Method Based on Principal Component Loading Vectors)

  • 박영준;김성범
    • 대한산업공학회지
    • /
    • 제40권3호
    • /
    • pp.275-282
    • /
    • 2014
  • One of the most widely used methods for dimensionality reduction is principal component analysis (PCA). However, the reduced dimensions from PCA do not provide a clear interpretation with respect to the original features because they are linear combinations of a large number of original features. This interpretation problem can be overcome by feature selection approaches that identifying the best subset of given features. In this study, we propose an unsupervised feature selection method based on the geometrical information of PCA loading vectors. Experimental results from a simulation study demonstrated the efficiency and usefulness of the proposed method.

Arabic Text Clustering Methods and Suggested Solutions for Theme-Based Quran Clustering: Analysis of Literature

  • Bsoul, Qusay;Abdul Salam, Rosalina;Atwan, Jaffar;Jawarneh, Malik
    • Journal of Information Science Theory and Practice
    • /
    • 제9권4호
    • /
    • pp.15-34
    • /
    • 2021
  • Text clustering is one of the most commonly used methods for detecting themes or types of documents. Text clustering is used in many fields, but its effectiveness is still not sufficient to be used for the understanding of Arabic text, especially with respect to terms extraction, unsupervised feature selection, and clustering algorithms. In most cases, terms extraction focuses on nouns. Clustering simplifies the understanding of an Arabic text like the text of the Quran; it is important not only for Muslims but for all people who want to know more about Islam. This paper discusses the complexity and limitations of Arabic text clustering in the Quran based on their themes. Unsupervised feature selection does not consider the relationships between the selected features. One weakness of clustering algorithms is that the selection of the optimal initial centroid still depends on chances and manual settings. Consequently, this paper reviews literature about the three major stages of Arabic clustering: terms extraction, unsupervised feature selection, and clustering. Six experiments were conducted to demonstrate previously un-discussed problems related to the metrics used for feature selection and clustering. Suggestions to improve clustering of the Quran based on themes are presented and discussed.

Unsupervised feature selection using orthogonal decomposition and low-rank approximation

  • Lim, Hyunki
    • 한국컴퓨터정보학회논문지
    • /
    • 제27권5호
    • /
    • pp.77-84
    • /
    • 2022
  • 본 논문에서는 새로운 비지도 특징 선별 기법을 제안한다. 기존 비지도 방식의 특징 선별 기법들은 특징을 선별하기 위해 가상의 레이블 데이터를 정하고 주어진 데이터를 이 레이블 데이터에 사영하는 회귀 분석 방식으로 특징을 선별하였다. 하지만 가상의 레이블은 데이터로부터 생성되기 때문에 사영된 공간이 비슷하게 형성될 수 있다. 따라서 기존의 방법들에서는 제한된 공간에서만 특징이 선택될 수 있었다. 이를 해소하기 위해 본 논문에서는 직교 사영과 저랭크 근사를 이용하여 특징을 선별한다. 이 문제를 해소하기 위해 가상의 레이블을 직교 사영하고 이 공간에 데이터를 사영할 수 있도록 한다. 이를 통해 더 주요한 특징 선별을 기대할 수 있다. 그리고 사영을 위한 변환 행렬에 저랭크 제한을 두어 더 효과적으로 저차원 공간의 특징을 선별할 수 있도록 한다. 이 목표를 달성하기 위해 본 논문에서는 비용 함수를 설계하고 효율적인 최적화 방법을 제안한다. 여섯 개의 데이터에 대한 실험 결과는 제안된 방법이 대부분의 경우 기존의 비지도 특징 선별 기법보다 좋은 성능을 보여주었다.

Unsupervised learning with hierarchical feature selection for DDoS mitigation within the ISP domain

  • Ko, Ili;Chambers, Desmond;Barrett, Enda
    • ETRI Journal
    • /
    • 제41권5호
    • /
    • pp.574-584
    • /
    • 2019
  • A new Mirai variant found recently was equipped with a dynamic update ability, which increases the level of difficulty for DDoS mitigation. Continuous development of 5G technology and an increasing number of Internet of Things (IoT) devices connected to the network pose serious threats to cyber security. Therefore, researchers have tried to develop better DDoS mitigation systems. However, the majority of the existing models provide centralized solutions either by deploying the system with additional servers at the host site, on the cloud, or at third party locations, which may cause latency. Since Internet service providers (ISP) are links between the internet and users, deploying the defense system within the ISP domain is the panacea for delivering an efficient solution. To cope with the dynamic nature of the new DDoS attacks, we utilized an unsupervised artificial neural network to develop a hierarchical two-layered self-organizing map equipped with a twofold feature selection for DDoS mitigation within the ISP domain.

Feature Impact Evaluation Based Pattern Classification System

  • Rhee, Hyun-Sook
    • 한국컴퓨터정보학회논문지
    • /
    • 제23권11호
    • /
    • pp.25-30
    • /
    • 2018
  • Pattern classification system is often an important component of intelligent systems. In this paper, we present a pattern classification system consisted of the feature selection module, knowledge base construction module and decision module. We introduce a feature impact evaluation selection method based on fuzzy cluster analysis considering computational approach and generalization capability of given data characteristics. A fuzzy neural network, OFUN-NET based on unsupervised learning data mining technique produces knowledge base for representative clusters. 240 blemish pattern images are prepared and applied to the proposed system. Experimental results show the feasibility of the proposed classification system as an automating defect inspection tool.

정보이론에 기반한 Supervised, Unsupervised 피처 선택 방법론 (Information-based Supervised and Unsupervised Feature Selection Methods)

  • 이상근;장병탁
    • 한국정보과학회:학술대회논문집
    • /
    • 한국정보과학회 2004년도 봄 학술발표논문집 Vol.31 No.1 (B)
    • /
    • pp.637-639
    • /
    • 2004
  • 많은 변수(variable)라 피처(feature)를 포함하는 대규모 데이터에 기계학습 방법론을 적용하는데 있어 그 예측 성능을 향상시키기 위한 방법으로 피처 선택(feature selection)기법이 활발히 연구되고 있다. 그러나 다른 연구를 위한 사전 데이터 분석 작업에 유용하게 사용될 수 있는 단순한 순위기반 피처 선택 방법론은 피처의 중요한 특성을 간과하는 경우가 많으며, 따라서 예측 성능의 향상을 기대하기 어렵다. 본 연구에서는 정보 이론에 기반한 supervised 피처 선택 방법과 이것을 보완할 수 있는 unsupervised 피처 선택 방법을 제시했다. 서로 다른 특성을 가진 다섯 개의 데이터셋에 대해 실험한 결과. 제시된 방법이 기존 방법보다 나은 예측 성능을 보임을 확인했다. 또한 두 방법에서 얻어진 피처들을 결합해 사용할 경우 한가지 방법만으로 추출된 피처를 사용할 경우보다 나은 기계 학습 성능을 보임을 확인했다.

  • PDF

고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법 (Association-based Unsupervised Feature Selection for High-dimensional Categorical Data)

  • 이창기;정욱
    • 품질경영학회지
    • /
    • 제47권3호
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

구조물의 품질 결함 변별력 증대를 위한 수직 에너지 기반의 웨이블릿 Feature 생성 (Structural Quality Defect Discrimination Enhancement using Vertical Energy-based Wavelet Feature Generation)

  • 김준석;정욱
    • 품질경영학회지
    • /
    • 제36권2호
    • /
    • pp.36-44
    • /
    • 2008
  • In this paper a novel feature extraction and selection is carried out in order to improve the discriminating capability between healthy and damaged structure using vibration signals. Although many feature extraction and selection algorithms have been proposed for vibration signals, most proposed approaches don't consider the discriminating ability of features since they are usually in unsupervised manner. We proposed a novel feature extraction and selection algorithm selecting few wavelet coefficients with higher class discriminating capability for damage detection and class visualization. We applied three class separability measures to evaluate the features, i.e. T test statistics, divergence, and Bhattacharyya distance. Experiments with vibration signals from truss structure demonstrate that class separabilities are significantly enhanced using our proposed algorithm compared to other two algorithms with original time-based features and Fourier-based ones.

토픽 모형을 이용한 텍스트 데이터의 단어 선택 (Feature selection for text data via topic modeling)

  • 장우솔;김예은;손원
    • 응용통계연구
    • /
    • 제35권6호
    • /
    • pp.739-754
    • /
    • 2022
  • 텍스트 데이터는 일반적으로 많은 변수를 포함하고 있으며 변수들 사이의 연관성도 높아 통계 분석의 정확성, 효율성 등에서 문제가 생길 수 있다. 이러한 문제점에 대처하기 위해 목표 변수가 주어진 지도 학습에서는 목표 변수를 잘 설명할 수 있는 단어들을 선택하여 이 단어들만 통계 분석에 이용하기도 한다. 반면, 비지도 학습에서는 목표 변수가 주어지지 않으므로 지도 학습에서와 같은 단어 선택 절차를 활용하기 어렵다. 이 연구에서는 토픽 모형을 이용하여 지도 학습에서의 목표 변수를 대신할 수 있는 토픽을 생성하고 각 토픽별로 연관성이 높은 단어들을 선택하는 단어 선택 절차를 제안한다. 제안된 절차를 실제 텍스트 데이터에 적용한 결과, 단어 선택 절차를 이용하면 많은 토픽에서 공통적으로 자주 등장하는 단어들을 제거함으로써 토픽을 더 명확하게 식별할 수 있었다. 또한, 군집 분석에 적용한 결과, 군집과 범주 사이에 높은 연관성을 가지는 군집 분석 결과를 얻을 수 있는 것으로 나타났다. 목표 변수에 대한 정보없이 토픽 모형을 이용하여 선택한 단어들을 분류 분석에 적용하였을 때 목표 변수를 이용하여 단어들을 선택한 경우와 비슷한 분류 정확성을 얻을 수 있음도 확인하였다.