• Title/Summary/Keyword: Data preprocessing technique

Search Result 170, Processing Time 0.028 seconds

Dimensionality reduction for pattern recognition based on difference of distribution among classes

  • Nishimura, Masaomi;Hiraoka, Kazuyuki;Mishima, Taketoshi
    • Proceedings of the IEEK Conference
    • /
    • 2002.07c
    • /
    • pp.1670-1673
    • /
    • 2002
  • For pattern recognition on high-dimensional data, such as images, the dimensionality reduction as a preprocessing is effective. By dimensionality reduction, we can (1) reduce storage capacity or amount of calculation, and (2) avoid "the curse of dimensionality" and improve classification performance. Popular tools for dimensionality reduction are Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Independent Component Analysis (ICA) recently. Among them, only LDA takes the class labels into consideration. Nevertheless, it, has been reported that, the classification performance with ICA is better than that with LDA because LDA has restriction on the number of dimensions after reduction. To overcome this dilemma, we propose a new dimensionality reduction technique based on an information theoretic measure for difference of distribution. It takes the class labels into consideration and still it does not, have restriction on number of dimensions after reduction. Improvement of classification performance has been confirmed experimentally.

  • PDF

Feature Extraction of Non-proliferative Diabetic Retinopathy Using Faster R-CNN and Automatic Severity Classification System Using Random Forest Method

  • Jung, Younghoon;Kim, Daewon
    • Journal of Information Processing Systems
    • /
    • v.18 no.5
    • /
    • pp.599-613
    • /
    • 2022
  • Non-proliferative diabetic retinopathy is a representative complication of diabetic patients and is known to be a major cause of impaired vision and blindness. There has been ongoing research on automatic detection of diabetic retinopathy, however, there is also a growing need for research on an automatic severity classification system. This study proposes an automatic detection system for pathological symptoms of diabetic retinopathy such as microaneurysms, retinal hemorrhage, and hard exudate by applying the Faster R-CNN technique. An automatic severity classification system was devised by training and testing a Random Forest classifier based on the data obtained through preprocessing of detected features. An experiment of classifying 228 test fundus images with the proposed classification system showed 97.8% accuracy.

Brain Extraction of MR Images

  • Du, Ruoyu;Lee, Hyo Jong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.04a
    • /
    • pp.455-458
    • /
    • 2010
  • Extracting the brain from magnetic resonance imaging head scans is an essential preprocessing step of which the accuracy greatly affects subsequent image analysis. The currently popular Brain Extraction Tool produces a brain mask which may be too smooth for practical use to reduce the accuracy. This paper presents a novel and indirect brain extraction method based on non-brain tissue segmentation. Based on ITK, the proposed method allows a non-brain contour by using region growing to match with the original image naturally and extract the brain tissue. Experiments on two set of MRI data and 2D brain image in horizontal plane and 3D brain model indicate successful extraction of brain tissue from a head.

Deep Learning Method for Identification and Selection of Relevant Features

  • Vejendla Lakshman
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.5
    • /
    • pp.212-216
    • /
    • 2024
  • Feature Selection have turned into the main point of investigations particularly in bioinformatics where there are numerous applications. Deep learning technique is a useful asset to choose features, anyway not all calculations are on an equivalent balance with regards to selection of relevant features. To be sure, numerous techniques have been proposed to select multiple features using deep learning techniques. Because of the deep learning, neural systems have profited a gigantic top recovery in the previous couple of years. Anyway neural systems are blackbox models and not many endeavors have been made so as to examine the fundamental procedure. In this proposed work a new calculations so as to do feature selection with deep learning systems is introduced. To evaluate our outcomes, we create relapse and grouping issues which enable us to think about every calculation on various fronts: exhibitions, calculation time and limitations. The outcomes acquired are truly encouraging since we figure out how to accomplish our objective by outperforming irregular backwoods exhibitions for each situation. The results prove that the proposed method exhibits better performance than the traditional methods.

Feature Selection Algorithm for Intrusions Detection System using Sequential Forward Search and Random Forest Classifier

  • Lee, Jinlee;Park, Dooho;Lee, Changhoon
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.10
    • /
    • pp.5132-5148
    • /
    • 2017
  • Cyber attacks are evolving commensurate with recent developments in information security technology. Intrusion detection systems collect various types of data from computers and networks to detect security threats and analyze the attack information. The large amount of data examined make the large number of computations and low detection rates problematic. Feature selection is expected to improve the classification performance and provide faster and more cost-effective results. Despite the various feature selection studies conducted for intrusion detection systems, it is difficult to automate feature selection because it is based on the knowledge of security experts. This paper proposes a feature selection technique to overcome the performance problems of intrusion detection systems. Focusing on feature selection, the first phase of the proposed system aims at constructing a feature subset using a sequential forward floating search (SFFS) to downsize the dimension of the variables. The second phase constructs a classification model with the selected feature subset using a random forest classifier (RFC) and evaluates the classification accuracy. Experiments were conducted with the NSL-KDD dataset using SFFS-RF, and the results indicated that feature selection techniques are a necessary preprocessing step to improve the overall system performance in systems that handle large datasets. They also verified that SFFS-RF could be used for data classification. In conclusion, SFFS-RF could be the key to improving the classification model performance in machine learning.

Forecasting Day-ahead Electricity Price Using a Hybrid Improved Approach

  • Hu, Jian-Ming;Wang, Jian-Zhou
    • Journal of Electrical Engineering and Technology
    • /
    • v.12 no.6
    • /
    • pp.2166-2176
    • /
    • 2017
  • Electricity price prediction plays a crucial part in making the schedule and managing the risk to the competitive electricity market participants. However, it is a difficult and challenging task owing to the characteristics of the nonlinearity, non-stationarity and uncertainty of the price series. This study proposes a hybrid improved strategy which incorporates data preprocessor components and a forecasting engine component to enhance the forecasting accuracy of the electricity price. In the developed forecasting procedure, the Seasonal Adjustment (SA) method and the Ensemble Empirical Mode Decomposition (EEMD) technique are synthesized as the data preprocessing component; the Coupled Simulated Annealing (CSA) optimization method and the Least Square Support Vector Regression (LSSVR) algorithm construct the prediction engine. The proposed hybrid approach is verified with electricity price data sampled from the power market of New South Wales in Australia. The simulation outcome manifests that the proposed hybrid approach obtains the observable improvement in the forecasting accuracy compared with other approaches, which suggests that the proposed combinational approach occupies preferable predication ability and enough precision.

An Outlier Detection Algorithm and Data Integration Technique for Prediction of Hypertension (고혈압 예측을 위한 이상치 탐지 알고리즘 및 데이터 통합 기법)

  • Khongorzul Dashdondov;Mi-Hye Kim;Mi-Hwa Song
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.417-419
    • /
    • 2023
  • Hypertension is one of the leading causes of mortality worldwide. In recent years, the incidence of hypertension has increased dramatically, not only among the elderly but also among young people. In this regard, the use of machine-learning methods to diagnose the causes of hypertension has increased in recent years. In this study, we improved the prediction of hypertension detection using Mahalanobis distance-based multivariate outlier removal using the KNHANES database from the Korean national health data and the COVID-19 dataset from Kaggle. This study was divided into two modules. Initially, the data preprocessing step used merged datasets and decision-tree classifier-based feature selection. The next module applies a predictive analysis step to remove multivariate outliers using the Mahalanobis distance from the experimental dataset and makes a prediction of hypertension. In this study, we compared the accuracy of each classification model. The best results showed that the proposed MAH_RF algorithm had an accuracy of 82.66%. The proposed method can be used not only for hypertension but also for the detection of various diseases such as stroke and cardiovascular disease.

Machine learning application to seismic site classification prediction model using Horizontal-to-Vertical Spectral Ratio (HVSR) of strong-ground motions

  • Francis G. Phi;Bumsu Cho;Jungeun Kim;Hyungik Cho;Yun Wook Choo;Dookie Kim;Inhi Kim
    • Geomechanics and Engineering
    • /
    • v.37 no.6
    • /
    • pp.539-554
    • /
    • 2024
  • This study explores development of prediction model for seismic site classification through the integration of machine learning techniques with horizontal-to-vertical spectral ratio (HVSR) methodologies. To improve model accuracy, the research employs outlier detection methods and, synthetic minority over-sampling technique (SMOTE) for data balance, and evaluates using seven machine learning models using seismic data from KiK-net. Notably, light gradient boosting method (LGBM), gradient boosting, and decision tree models exhibit improved performance when coupled with SMOTE, while Multiple linear regression (MLR) and Support vector machine (SVM) models show reduced efficacy. Outlier detection techniques significantly enhance accuracy, particularly for LGBM, gradient boosting, and voting boosting. The ensemble of LGBM with the isolation forest and SMOTE achieves the highest accuracy of 0.91, with LGBM and local outlier factor yielding the highest F1-score of 0.79. Consistently outperforming other models, LGBM proves most efficient for seismic site classification when supported by appropriate preprocessing procedures. These findings show the significance of outlier detection and data balancing for precise seismic soil classification prediction, offering insights and highlighting the potential of machine learning in optimizing site classification accuracy.

A Query Pruning Technique for Optimizing Regular Path Expressions in Semistructured Databases (준구조적 데이타베이스에서의 정규경로표현 최적화를 위한 질의전지 기법)

  • Park, Chang-Won;Jeong, Jin-Wan
    • Journal of KIISE:Databases
    • /
    • v.29 no.3
    • /
    • pp.217-229
    • /
    • 2002
  • Regular path expressions are primary elements for formulating queries over the semistructured data that does not assume the conventional schemas. In addition, the query pruning is an important optimization technique to avoid useless traversals in evaluating regular path expressions. However, the existing query pruning often fails to fully optimize multiple regular path expressions, and the previous methods that post-process the result of the existing query pruning must check exponential combinations of sub-results. In this paper, we present a new query pruning technique that consists of the preprocessing phase and the pruning phase. Our two-phase query pruning is affective in optimizing multiple regular path expressions, and is more scalable than the previous methods in that it never check the exponential combinations of sub-results.

Net Analyte Signal-based Quantitative Determination of Fusel Oil in Korean Alcoholic Beverage Using FT-NIR Spectroscopy

  • Lohumi, Santosh;Kandpal, Lalit Mohan;Seo, Young Wook;Cho, Byoung Kwan
    • Journal of Biosystems Engineering
    • /
    • v.41 no.3
    • /
    • pp.208-220
    • /
    • 2016
  • Purpose: Fusel oil is a potent volatile aroma compound found in many alcoholic beverages. At low concentrations, it makes an essential contribution to the flavor and aroma of fermented alcoholic beverages, while at high concentrations, it induced an off-flavor and is thought to cause undesirable side effects. In this work, we introduce Fourier transform near-infrared (FT-NIR) spectroscopy as a rapid and nondestructive technique for the quantitative determination of fusel oil in the Korean alcoholic beverage "soju". Methods: FT-NIR transmittance spectra in the 1000-2500 nm region were collected for 120 soju samples with fusel oil concentrations ranging from 0 to 1400 ppm. The calibration and validation data sets were designed using data from 75 and 45 samples, respectively. The net analyte signal (NAS) was used as a preprocessing method before the application of the partial least-square regression (PLSR) and principal component regression (PCR) methods for predicting fusel oil concentration. A novel variable selection method was adopted to determine the most informative spectral variables to minimize the effect of nonmodeled interferences. Finally, the efficiency of the developed technique was evaluated with two different validation sets. Results: The results revealed that the NAS-PLSR model with selected variables ($R^2_{\upsilon}=0.95$, RMSEV = 100ppm) did not outperform the NAS-PCR model (($R^2_{\upsilon}=0.97$, RMSEV = 7 8.9ppm). In addition, the NAS-PCR shows a better recovery for validation set 2 and a lower relative error for validation set 3 than the NAS-PLSR model. Conclusion: The experimental results indicate that the proposed technique could be an alternative to conventional methods for the quantitative determination of fusel oil in alcoholic beverages and has the potential for use in in-line process control.