• Title/Summary/Keyword: Data Preprocessing

Search Result 997, Processing Time 0.024 seconds

An Efficient Method for Detecting Denial of Service Attacks Using Kernel Based Data (커널 기반 데이터를 이용한 효율적인 서비스 거부 공격 탐지 방법에 관한 연구)

  • Chung, Man-Hyun;Cho, Jae-Ik;Chae, Soo-Young;Moon, Jong-Sub
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.19 no.1
    • /
    • pp.71-79
    • /
    • 2009
  • Currently much research is being done on host based intrusion detection using system calls which is a portion of kernel based data. Sequence based and frequency based preprocessing methods are mostly used in research for intrusion detection using system calls. Due to the large amount of data and system call types, it requires a significant amount of preprocessing time. Therefore, it is difficult to implement real-time intrusion detection systems. Despite this disadvantage, the frequency based method which requires a relatively small amount of preprocessing time is usually used. This paper proposes an effective method for detecting denial of service attacks using the frequency based method. Principal Component Analysis(PCA) will be used to select the principle system calls and a bayesian network will be composed and the bayesian classifier will be used for the classification.

Designing Summary Tables for Mining Web Log Data

  • Ahn, Jeong-Yong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.16 no.1
    • /
    • pp.157-163
    • /
    • 2005
  • In the Web, the data is generally gathered automatically by Web servers and collected in server or access logs. However, as users access larger and larger amounts of data, query response times to extract information inevitably get slower. A method to resolve this issue is the use of summary tables. In this short note, we design a prototype of summary tables that can efficiently extract information from Web log data. We also present the relative performance of the summary tables against a sampling technique and a method that uses raw data.

  • PDF

Hybrid Learning Architectures for Advanced Data Mining:An Application to Binary Classification for Fraud Management (개선된 데이터마이닝을 위한 혼합 학습구조의 제시)

  • Kim, Steven H.;Shin, Sung-Woo
    • Journal of Information Technology Application
    • /
    • v.1
    • /
    • pp.173-211
    • /
    • 1999
  • The task of classification permeates all walks of life, from business and economics to science and public policy. In this context, nonlinear techniques from artificial intelligence have often proven to be more effective than the methods of classical statistics. The objective of knowledge discovery and data mining is to support decision making through the effective use of information. The automated approach to knowledge discovery is especially useful when dealing with large data sets or complex relationships. For many applications, automated software may find subtle patterns which escape the notice of manual analysis, or whose complexity exceeds the cognitive capabilities of humans. This paper explores the utility of a collaborative learning approach involving integrated models in the preprocessing and postprocessing stages. For instance, a genetic algorithm effects feature-weight optimization in a preprocessing module. Moreover, an inductive tree, artificial neural network (ANN), and k-nearest neighbor (kNN) techniques serve as postprocessing modules. More specifically, the postprocessors act as second0order classifiers which determine the best first-order classifier on a case-by-case basis. In addition to the second-order models, a voting scheme is investigated as a simple, but efficient, postprocessing model. The first-order models consist of statistical and machine learning models such as logistic regression (logit), multivariate discriminant analysis (MDA), ANN, and kNN. The genetic algorithm, inductive decision tree, and voting scheme act as kernel modules for collaborative learning. These ideas are explored against the background of a practical application relating to financial fraud management which exemplifies a binary classification problem.

  • PDF

A Comparative Analysis of the Pre-Processing in the Kaggle Titanic Competition

  • Tai-Sung, Hur;Suyoung, Bang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.3
    • /
    • pp.17-24
    • /
    • 2023
  • Based on the problem of 'Tatanic - Machine Learning from Disaster', a representative competition of Kaggle that presents challenges related to data science and solves them, we want to see how data preprocessing and model construction affect prediction accuracy and score. We compare and analyze the features by selecting seven top-ranked solutions with high scores, except when using redundant models or ensemble techniques. It was confirmed that most of the pretreatment has unique and differentiated characteristics, and although the pretreatment process was almost the same, there were differences in scores depending on the type of model. The comparative analysis study in this paper is expected to help participants in the kaggle competition and data science beginners by understanding the characteristics and analysis flow of the preprocessing methods of the top score participants.

Special Quantum Steganalysis Algorithm for Quantum Secure Communications Based on Quantum Discriminator

  • Xinzhu Liu;Zhiguo Qu;Xiubo Chen;Xiaojun Wang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.6
    • /
    • pp.1674-1688
    • /
    • 2023
  • The remarkable advancement of quantum steganography offers enhanced security for quantum communications. However, there is a significant concern regarding the potential misuse of this technology. Moreover, the current research on identifying malicious quantum steganography is insufficient. To address this gap in steganalysis research, this paper proposes a specialized quantum steganalysis algorithm. This algorithm utilizes quantum machine learning techniques to detect steganography in general quantum secure communication schemes that are based on pure states. The algorithm presented in this paper consists of two main steps: data preprocessing and automatic discrimination. The data preprocessing step involves extracting and amplifying abnormal signals, followed by the automatic detection of suspicious quantum carriers through training on steganographic and non-steganographic data. The numerical results demonstrate that a larger disparity between the probability distributions of steganographic and non-steganographic data leads to a higher steganographic detection indicator, making the presence of steganography easier to detect. By selecting an appropriate threshold value, the steganography detection rate can exceed 90%.

Predicting idiopathic pulmonary fibrosis (IPF) disease in patients using machine approaches

  • Ali, Sikandar;Hussain, Ali;Kim, Hee-Cheol
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.05a
    • /
    • pp.144-146
    • /
    • 2021
  • Idiopathic pulmonary fibrosis (IPF) is one of the most dreadful lung diseases which effects the performance of the lung unpredictably. There is no any authentic natural history discovered yet pertaining to this disease and it has been very difficult for the physicians to diagnosis this disease. With the advent of Artificial intelligent and its related technologies this task has become a little bit easier. The aim of this paper is to develop and to explore the machine learning models for the prediction and diagnosis of this mysterious disease. For our study, we got IPF dataset from Haeundae Paik hospital consisting of 2425 patients. This dataset consists of 502 features. We applied different data preprocessing techniques for data cleaning while making the data fit for the machine learning implementation. After the preprocessing of the data, 18 features were selected for the experiment. In our experiment, we used different machine learning classifiers i.e., Multilayer perceptron (MLP), Support vector machine (SVM), and Random forest (RF). we compared the performance of each classifier. The experimental results showed that MLP outperformed all other compared models with 91.24% accuracy.

  • PDF

Alzheimer progression classification using fMRI data (fMRI 데이터를 이용한 알츠하이머 진행상태 분류)

  • Ju Hyeon-Noh;Hee-Deok Yang
    • Smart Media Journal
    • /
    • v.13 no.4
    • /
    • pp.86-93
    • /
    • 2024
  • The development of functional magnetic resonance imaging (fMRI) has significantly contributed to mapping brain functions and understanding brain networks during rest. This paper proposes a CNN-LSTM-based classification model to classify the progression stages of Alzheimer's disease. Firstly, four preprocessing steps are performed to remove noise from the fMRI data before feature extraction. Secondly, the U-Net architecture is utilized to extract spatial features once preprocessing is completed. Thirdly, the extracted spatial features undergo LSTM processing to extract temporal features, ultimately leading to classification. Experiments were conducted by adjusting the temporal dimension of the data. Using 5-fold cross-validation, an average accuracy of 96.4% was achieved, indicating that the proposed method has high potential for identifying the progression of Alzheimer's disease by analyzing fMRI data.

Development of an intelligent IIoT platform for stable data collection (안정적 데이터 수집을 위한 지능형 IIoT 플랫폼 개발)

  • Woojin Cho;Hyungah Lee;Dongju Kim;Jae-hoi Gu
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.4
    • /
    • pp.687-692
    • /
    • 2024
  • The energy crisis is emerging as a serious problem around the world. In the case of Korea, there is great interest in energy efficiency research related to industrial complexes, which use more than 53% of total energy and account for more than 45% of greenhouse gas emissions in Korea. One of the studies is a study on saving energy through sharing facilities between factories using the same utility in an industrial complex called a virtual energy network plant and through transactions between energy producing and demand factories. In such energy-saving research, data collection is very important because there are various uses for data, such as analysis and prediction. However, existing systems had several shortcomings in reliably collecting time series data. In this study, we propose an intelligent IIoT platform to improve it. The intelligent IIoT platform includes a preprocessing system to identify abnormal data and process it in a timely manner, classifies abnormal and missing data, and presents interpolation techniques to maintain stable time series data. Additionally, time series data collection is streamlined through database optimization. This paper contributes to increasing data usability in the industrial environment through stable data collection and rapid problem response, and contributes to reducing the burden of data collection and optimizing monitoring load by introducing a variety of chatbot notification systems.

From The Discovery Challenge on Thrombosis Data

  • Takabayashi, Katsuhiko;Tsumoto, Shusaku
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2001.01a
    • /
    • pp.361-363
    • /
    • 2001
  • Although data mining promises a new paradigm to discover medical knowledge form a database, there are many problems to be solved before real application is feasible. We had the chance to provide a data set to be analyzed as a discovery challenge by using various data mining techniques at the PKDD conference. As data providers, we evaluated and discussed results and clarified problems.

  • PDF

An Algorithm for Baseline Correction of SELDI/MALDI Mass Spectrometry Data

  • Lee, Kyeong-Eun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.4
    • /
    • pp.1289-1297
    • /
    • 2006
  • Before other statistical data analysis the preprocessing steps should be performed adequately to have meaningful results. These steps include processes such as baseline correction, normalization, denoising, and multiple alignment. In this paper an algorithm for baseline correction is proposed with using the piecewise cubic Hermite interpolation with block-selected points and local minima after denoising for SELDI or MALDI mass spectrometry data.

  • PDF