• Title/Summary/Keyword: Feature Set Selection

Search Result 189, Processing Time 0.022 seconds

A Pre-processing Study to Solve the Problem of Rare Class Classification of Network Traffic Data (네트워크 트래픽 데이터의 희소 클래스 분류 문제 해결을 위한 전처리 연구)

  • Ryu, Kyung Joon;Shin, DongIl;Shin, DongKyoo;Park, JeongChan;Kim, JinGoog
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.12
    • /
    • pp.411-418
    • /
    • 2020
  • In the field of information security, IDS(Intrusion Detection System) is normally classified in two different categories: signature-based IDS and anomaly-based IDS. Many studies in anomaly-based IDS have been conducted that analyze network traffic data generated in cyberspace by machine learning algorithms. In this paper, we studied pre-processing methods to overcome performance degradation problems cashed by rare classes. We experimented classification performance of a Machine Learning algorithm by reconstructing data set based on rare classes and semi rare classes. After reconstructing data into three different sets, wrapper and filter feature selection methods are applied continuously. Each data set is regularized by a quantile scaler. Depp neural network model is used for learning and validation. The evaluation results are compared by true positive values and false negative values. We acquired improved classification performances on all of three data sets.

A Chi-Square-Based Decision for Real-Time Malware Detection Using PE-File Features

  • Belaoued, Mohamed;Mazouzi, Smaine
    • Journal of Information Processing Systems
    • /
    • v.12 no.4
    • /
    • pp.644-660
    • /
    • 2016
  • The real-time detection of malware remains an open issue, since most of the existing approaches for malware categorization focus on improving the accuracy rather than the detection time. Therefore, finding a proper balance between these two characteristics is very important, especially for such sensitive systems. In this paper, we present a fast portable executable (PE) malware detection system, which is based on the analysis of the set of Application Programming Interfaces (APIs) called by a program and some technical PE features (TPFs). We used an efficient feature selection method, which first selects the most relevant APIs and TPFs using the chi-square ($KHI^2$) measure, and then the Phi (${\varphi}$) coefficient was used to classify the features in different subsets, based on their relevance. We evaluated our method using different classifiers trained on different combinations of feature subsets. We obtained very satisfying results with more than 98% accuracy. Our system is adequate for real-time detection since it is able to categorize a file (Malware or Benign) in 0.09 seconds.

An Algorithm for Automatic Guided Vehicle Scheduling Problems (자동유도운반차 (Automatic Guided Vehicle) 스케쥴링 해법)

  • Park, Yang-Byeong;Jeon, Deok-Bin
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.13 no.1
    • /
    • pp.11-24
    • /
    • 1987
  • Automatic Guided Vehicle Systems feature battery powered driverless vehicles with programming capabilities for path selection and positoning. Vehicles serve the machines in shop, following a guide path system installed on the shop floor. The basic problem in the system is to determine a fixed set of vehicle routes of minimal total distance(time) while keeping capacity and distance(time) constraints. In this paper, a heuristic algorithm is presented for scheduling the automatic guided vehicles. The algorithm routes the machines based on their distances and polor coordinate angles, taking into account the structural feature of the system. Computational experiments are performed on several test problems in order to evaluate the proposed algorithm. Finally, a framework for dealing with the case where supplies from the machines are probabilistic is described.

  • PDF

A Study on Classifications of Remote Sensed Multispectral Image Data using Soft Computing Technique - Stressed on Rough Sets - (소프트 컴퓨팅기술을 이용한 원격탐사 다중 분광 이미지 데이터의 분류에 관한 연구 -Rough 집합을 중심으로-)

  • Won Sung-Hyun
    • Management & Information Systems Review
    • /
    • v.3
    • /
    • pp.15-45
    • /
    • 1999
  • Processing techniques of remote sensed image data using computer have been recognized very necessary techniques to all social fields, such as, environmental observation, land cultivation, resource investigation, military trend grasp and agricultural product estimation, etc. Especially, accurate classification and analysis to remote sensed image da are important elements that can determine reliability of remote sensed image data processing systems, and many researches have been processed to improve these accuracy of classification and analysis. Traditionally, remote sensed image data processing systems have been processed 2 or 3 selected bands in multiple bands, in this time, their selection criterions are statistical separability or wavelength properties. But, it have be bring up the necessity of bands selection method by data distribution characteristics than traditional bands selection by wavelength properties or statistical separability. Because data sensing environments change from multispectral environments to hyperspectral environments. In this paper for efficient data classification in multispectral bands environment, a band feature extraction method using the Rough sets theory is proposed. First, we make a look up table from training data, and analyze the properties of experimental multispectral image data, then select the efficient band using indiscernibility relation of Rough set theory from analysis results. Proposed method is applied to LANDSAT TM data on 2 June 1992. From this, we show clustering trends that similar to traditional band selection results by wavelength properties, from this, we verify that can use the proposed method that centered on data properties to select the efficient bands, though data sensing environment change to hyperspectral band environments.

  • PDF

Integration rough set theory and case-base reasoning for the corporate credit evaluation (러프집합이론과 사례기반추론을 결합한 기업신용평가 모형)

  • Roh, Tae-Hyup;Yoo Myung-Hwan;Han In-Goo
    • The Journal of Information Systems
    • /
    • v.14 no.1
    • /
    • pp.41-65
    • /
    • 2005
  • The credit ration is a significant area of financial management which is of major interest to practitioners, financial and credit analysts. The components of credit rating are identified decision models are developed to assess credit rating an the corresponding creditworthiness of firms an accurately ad possble. Although many early studies demonstrate a priori which of these techniques will be most effective to solve a specific classification problem. Recently, a number of studies have demonstrate that a hybrid model integration artificial intelligence approaches with other feature selection algorthms can be alternative methodologies for business classification problems. In this article, we propose a hybrid approach using rough set theory as an alternative methodology to select appropriate attributes for case-based reasoning. This model uses rough specific interest lies in lthe stable combining of both rough set theory to extract knowledge that can guide dffective retrevals of useful cases. Our specific interest lies in the stable combining of both rough set theory and case-based reasoning in the problem of corporate credit rating. In addition, we summarize backgrounds of applying integrated model in the field of corporate credit rating with a brief description of various credit rating methodologies.

  • PDF

A Semantic-Based Feature Expansion Approach for Improving the Effectiveness of Text Categorization by Using WordNet (문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구)

  • Chung, Eun-Kyung
    • Journal of the Korean Society for information Management
    • /
    • v.26 no.3
    • /
    • pp.261-278
    • /
    • 2009
  • Identifying optimal feature sets in Text Categorization(TC) is crucial in terms of improving the effectiveness. In this study, experiments on feature expansion were conducted using author provided keyword sets and article titles from typical scientific journal articles. The tool used for expanding feature sets is WordNet, a lexical database for English words. Given a data set and a lexical tool, this study presented that feature expansion with synonymous relationship was significantly effective on improving the results of TC. The experiment results pointed out that when expanding feature sets with synonyms using on classifier names, the effectiveness of TC was considerably improved regardless of word sense disambiguation.

Document Classification of Small Size Documents Using Extended Relief-F Algorithm (확장된 Relief-F 알고리즘을 이용한 소규모 크기 문서의 자동분류)

  • Park, Heum
    • The KIPS Transactions:PartB
    • /
    • v.16B no.3
    • /
    • pp.233-238
    • /
    • 2009
  • This paper presents an approach to the classifications of small size document using the instance-based feature filtering Relief-F algorithm. In the document classifications, we have not always good classification performances of small size document included a few features. Because total number of feature in the document set is large, but feature count of each document is very small relatively, so the similarities between documents are very low when we use general assessment of similarity and classifiers. Specially, in the cases of the classification of web document in the directory service and the classification of the sectors that cannot connect with the original file after recovery hard-disk, we have not good classification performances. Thus, we propose the Extended Relief-F(ERelief-F) algorithm using instance-based feature filtering algorithm Relief-F to solve problems of Relief-F as preprocess of classification. For the performance comparison, we tested information gain, odds ratio and Relief-F for feature filtering and getting those feature values, and used kNN and SVM classifiers. In the experimental results, the Extended Relief-F(ERelief-F) algorithm, compared with the others, performed best for all of the datasets and reduced many irrelevant features from document sets.

Compression efficiency improvement on JPEG2000 still image coding using improved Set Partitioning Sorting Algorithm (분할 정렬 알고리즘의 개선을 통한 JPEG2000 정지영상 부호화에서의 압축 효율 개선)

  • Ju Dong-hyun;Kim Doo-young
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.9 no.5
    • /
    • pp.1025-1030
    • /
    • 2005
  • With the increasing use of multimedia technologies, image compression requires higher performance as well as new functionality. Specially, in the specific area of still image encoding, a new standard, JPEG2000 was developed. This paper proposed Set Partitioning Sorting Algorithm that uses a method to optimized selection of threshold from feature of wavelet transform coefficients and to removes sign bit in LL area on JPEG2000. Experimental results show the proposed algorithm achieves more improved bit rate.

Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification (공격 메일 식별을 위한 비정형 데이터를 사용한 유전자 알고리즘 기반의 특징선택 알고리즘)

  • Hong, Sung-Sam;Kim, Dong-Wook;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.20 no.1
    • /
    • pp.1-10
    • /
    • 2019
  • Since big-data text mining extracts many features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term document matrix obtained through text mining represents term-document features, but produces a sparse matrix. We designed an advanced genetic algorithm (GA) to extract features in text mining for detection model. Term frequency inverse document frequency (TF-IDF) is used to reflect the document-term relationships in feature extraction. Through a repetitive process, a predetermined number of features are selected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitness function. We also verified its performance by applying the proposed algorithm to text classification. As a result, we have found that our algorithm shows higher performance (speed and accuracy) in attack mail classification.

Training Sample and Feature Selection Methods for Pseudo Sample Neural Networks (의사 샘플 신경망에서 학습 샘플 및 특징 선택 기법)

  • Heo, Gyeongyong;Park, Choong-Shik;Lee, Chang-Woo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.4
    • /
    • pp.19-26
    • /
    • 2013
  • Pseudo sample neural network (PSNN) is a variant of traditional neural network using pseudo samples to mitigate the local-optima-convergence problem when the size of training samples is small. PSNN can take advantage of the smoothed solution space through the use of pseudo samples. PSNN has a focus on the quantity problem in training, whereas, methods stressing the quality of training samples is presented in this paper to improve further the performance of PSNN. It is evident that typical samples and highly correlated features help in training. In this paper, therefore, kernel density estimation is used to select typical samples and correlation factor is introduced to select features, which can improve the performance of PSNN. Debris flow data set is used to demonstrate the usefulness of the proposed methods.