• Title/Summary/Keyword: Data Preprocessing

Search Result 967, Processing Time 0.032 seconds

Preprocessing Methods and Analysis of Grid Size for Watershed Extraction (유역경계 추출을 위한 DEM별 전처리 방법과 격자크기 분석)

  • Kim, Dong-Moon
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.26 no.1
    • /
    • pp.41-50
    • /
    • 2008
  • Recent progress in state-of-the-art geospatial information technologies such as digital mapping, LiDAR(Light Detection And Ranging), and high-resolution satellite imagery provides various data sources fer Digital Elevation Model(DEM). DEMs are major source to extract elements of the hydrological terrain property that are necessary for efficient watershed management. Especially, watersheds extracted from DEM are important geospatial database to identify physical boundaries that are utilized in water resource management plan including water environmental survey, pollutant investigation, polluted/wasteload/pollution load allocation estimation, and water quality modeling. Most of the previous studies related with watershed extraction using DEM are mainly focused on the hydrological elements analysis and preprocessing without considering grid size of the DEMs. This study aims to analyze accuracy of the watersheds extracted from DEMs with various grid sizes generated by LiDAR data and digital map, and appropriate preprocessing methods.

AutoFe-Sel: A Meta-learning based methodology for Recommending Feature Subset Selection Algorithms

  • Irfan Khan;Xianchao Zhang;Ramesh Kumar Ayyasam;Rahman Ali
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.7
    • /
    • pp.1773-1793
    • /
    • 2023
  • Automated machine learning, often referred to as "AutoML," is the process of automating the time-consuming and iterative procedures that are associated with the building of machine learning models. There have been significant contributions in this area across a number of different stages of accomplishing a data-mining task, including model selection, hyper-parameter optimization, and preprocessing method selection. Among them, preprocessing method selection is a relatively new and fast growing research area. The current work is focused on the recommendation of preprocessing methods, i.e., feature subset selection (FSS) algorithms. One limitation in the existing studies regarding FSS algorithm recommendation is the use of a single learner for meta-modeling, which restricts its capabilities in the metamodeling. Moreover, the meta-modeling in the existing studies is typically based on a single group of data characterization measures (DCMs). Nonetheless, there are a number of complementary DCM groups, and their combination will allow them to leverage their diversity, resulting in improved meta-modeling. This study aims to address these limitations by proposing an architecture for preprocess method selection that uses ensemble learning for meta-modeling, namely AutoFE-Sel. To evaluate the proposed method, we performed an extensive experimental evaluation involving 8 FSS algorithms, 3 groups of DCMs, and 125 datasets. Results show that the proposed method achieves better performance compared to three baseline methods. The proposed architecture can also be easily extended to other preprocessing method selections, e.g., noise-filter selection and imbalance handling method selection.

Preprocessing and Calibration of Optical Diffuse Reflectance Signal for Estimation of Soil Physical and Chemical Properties in the Central USA (미국 중부 토양의 이화학적 특성 추정을 위한 광 확산 반사 신호 전처리 및 캘리브레이션)

  • La, Woo-Jung;Sudduth, Kenneth A.;Chung, Sun-Ok;Kim, Hak-Jin
    • Journal of Biosystems Engineering
    • /
    • v.33 no.6
    • /
    • pp.430-437
    • /
    • 2008
  • Optical diffuse reflectance sensing in visible and near-infrared wavelength ranges is one approach to rapidly quantify soil properties for site-specific management. The objectives of this study were to investigate effects of preprocessing of reflectance data and determine the accuracy of the reflectance approach for estimating physical and chemical properties of selected Missouri and Illinois, USA surface soils encompassing a wide range of soil types and textures. Diffuse reflectance spectra of air-dried, sieved samples were obtained in the laboratory. Calibrations relating spectra to soil properties determined by standard methods were developed using partial least squares (PLS) regression. The best data preprocessing, consisting of absorbance transformation and mean centering, reduced estimation errors by up to 20% compared to raw reflectance data. Good estimates ($R^2=0.83$ to 0.92) were obtained using spectral data for soil texture fractions, organic matter, and CEC. Estimates of pH, P, and K were not good ($R^2$ < 0.7), and other approaches to estimating these soil chemical properties should be investigated. Overall, the ability of diffuse reflectance spectroscopy to accurately estimate multiple soil properties across a wide range of soils makes it a good candidate technology for providing at least a portion of the data needed in site-specific management of agriculture.

Comparison of Anomaly Detection Performance Based on GRU Model Applying Various Data Preprocessing Techniques and Data Oversampling (다양한 데이터 전처리 기법과 데이터 오버샘플링을 적용한 GRU 모델 기반 이상 탐지 성능 비교)

  • Yoo, Seung-Tae;Kim, Kangseok
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.2
    • /
    • pp.201-211
    • /
    • 2022
  • According to the recent change in the cybersecurity paradigm, research on anomaly detection methods using machine learning and deep learning techniques, which are AI implementation technologies, is increasing. In this study, a comparative study on data preprocessing techniques that can improve the anomaly detection performance of a GRU (Gated Recurrent Unit) neural network-based intrusion detection model using NGIDS-DS (Next Generation IDS Dataset), an open dataset, was conducted. In addition, in order to solve the class imbalance problem according to the ratio of normal data and attack data, the detection performance according to the oversampling ratio was compared and analyzed using the oversampling technique applied with DCGAN (Deep Convolutional Generative Adversarial Networks). As a result of the experiment, the method preprocessed using the Doc2Vec algorithm for system call feature and process execution path feature showed good performance, and in the case of oversampling performance, when DCGAN was used, improved detection performance was shown.

Effect of a Preprocessing Method on Inverting Chemiluminescence Images of Flames Burning Substitute Natural Gas (대체천연가스 화염 이미지 역변환에서 전처리 효과)

  • Ahn, Kwangho;Song, Wonjoon;Cha, Dongjin
    • Korean Journal of Air-Conditioning and Refrigeration Engineering
    • /
    • v.27 no.12
    • /
    • pp.609-619
    • /
    • 2015
  • A preprocessing scheme utilizing multi-division of the ROI (region of interest) in a chemiluminescence image during inversion is proposed. The resulting inverted image shows the flame's structure, which can be useful for studying combustion instability. The flame structure is often quantitatively visualized with PLIF (planar laser-induced fluorescence) images as well. The chemiluminescence image, which is a line-integral of the flame, needs to be preprocessed before inversion, mainly due to the inherent noise and the assumption of axisymmetry during the inversion. The feasibility of the multi-division preprocessing technique has been tested with experimentally-obtained OH PLIF and $OH^*$ chemiluminescence images of jet and swirl-stabilized flames burning substitute natural gas (SNG). It turns out that the technique outperforms two conventional methods, specifically, the technique without preprocessing and the one with uni-division, reconstructing the SNG flame structures much better than its two counterparts when compared using corresponding OH PLIF images. The characteristics of the optimum degree of polynomials to be applied for curve-fitting of the flame region data for the multi-division method involving two flames has also been investigated.

Framework for Efficient Web Page Prediction using Deep Learning

  • Kim, Kyung-Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.12
    • /
    • pp.165-172
    • /
    • 2020
  • Recently, due to exponential growth of access information on the web, the importance of predicting a user's next web page use has been increasing. One of the methods that can be used for predicting user's next web page is deep learning. To predict next web page, web logs are analyzed by data preprocessing and then a user's next web page is predicted on the output of the analyzed web logs using a deep learning algorithm. In this paper, we propose a framework for web page prediction that includes methods for web log preprocessing followed by deep learning techniques for web prediction. To increase the speed of preprocessing of large web log, a Hadoop based MapReduce programming model is used. In addition, we present a web prediction system that uses an efficient deep learning technique on the output of web log preprocessing for training and prediction. Through experiment, we show the performance improvement of our proposed method over traditional methods. We also show the accuracy of our prediction.

Design of PCA-based pRBFNNs Pattern Classifier for Digit Recognition (숫자 인식을 위한 PCA 기반 pRBFNNs 패턴 분류기 설계)

  • Lee, Seung-Cheol;Oh, Sung-Kwun;Kim, Hyun-Ki
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.4
    • /
    • pp.355-360
    • /
    • 2015
  • In this paper, we propose the design of Radial Basis Function Neural Network based on PCA in order to recognize handwritten digits. The proposed pattern classifier consists of the preprocessing step of PCA and the pattern classification step of pRBFNNs. In the preprocessing step, Feature data is obtained through preprocessing step of PCA for minimizing the information loss of given data and then this data is used as input data to pRBFNNs. The hidden layer of the proposed classifier is built up by Fuzzy C-Means(FCM) clustering algorithm and the connection weights are defined as linear polynomial function. In the output layer, polynomial parameters are obtained by using Least Square Estimation (LSE). MNIST database known as one of the benchmark handwritten dataset is applied for the performance evaluation of the proposed classifier. The experimental results of the proposed system are compared with other existing classifiers.

User Identification and Session completion in Input Data Preprocessing for Web Mining (웹 마이닝을 위한 입력 데이타의 전처리과정에서 사용자구분과 세션보정)

  • 최영환;이상용
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.9
    • /
    • pp.843-849
    • /
    • 2003
  • Web usage mining is the technique of data mining that analyzes web users' usage patterns by large web log. To use the web usage mining technique, we have to classify correctly users and users session in preprocessing, but can't classify them completely by only log files with standard web log format. To classify users and user session there are many problems like local cache, firewall, ISP, user privacy, cookey etc., but there isn't any definite method to solve the problems now. Especially local cache problem is the most difficult problem to classify user session which is used as input in web mining systems. In this paper we propose a heuristic method which solves local cache problem by using only click stream data of server side like referrer log, agent log and access log, classifies user sessions and completes session.

An Artificial Intelligent based Learning Model for BIM Elements Usage (건축 부재 사용량 예측을 위한 인공지능 학습 모델)

  • Beom-Su Kim;Jong-Hyeok Park;Soo-Hee Han;Kyung-Jun Kim
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.18 no.1
    • /
    • pp.107-114
    • /
    • 2023
  • This study described a method of designing and implementing an artificial intelligence-based learning model for predicting the usage of building members. Artificial intelligence (AI) is widely used in various fields thanks to the development of technology, but in the field of building information management (BIM), the case of utilizing AI technology is very low due to the specificity of the data in the field and the difficulty of collecting big data. Therefore, AI problems for BIM were discovered, and a new preprocessing technique was devised to solve the specificity of data in the field. An artificial intelligence model was implemented based on the designed preprocessing technique, and it was confirmed that the accuracy of predicting the construction component usage of the implemented artificial intelligence model is at a level that can be used in the actual industry.

Implementation of Recipe Recommendation System Using Ingredients Combination Analysis based on Recipe Data (레시피 데이터 기반의 식재료 궁합 분석을 이용한 레시피 추천 시스템 구현)

  • Min, Seonghee;Oh, Yoosoo
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.8
    • /
    • pp.1114-1121
    • /
    • 2021
  • In this paper, we implement a recipe recommendation system using ingredient harmonization analysis based on recipe data. The proposed system receives an image of a food ingredient purchase receipt to recommend ingredients and recipes to the user. Moreover, it performs preprocessing of the receipt images and text extraction using the OCR algorithm. The proposed system can recommend recipes based on the combined data of ingredients. It collects recipe data to calculate the combination for each food ingredient and extracts the food ingredients of the collected recipe as training data. And then, it acquires vector data by learning with a natural language processing algorithm. Moreover, it can recommend recipes based on ingredients with high similarity. Also, the proposed system can recommend recipes using replaceable ingredients to improve the accuracy of the result through preprocessing and postprocessing. For our evaluation, we created a random input dataset to evaluate the proposed recipe recommendation system's performance and calculated the accuracy for each algorithm. As a result of performance evaluation, the accuracy of the Word2Vec algorithm was the highest.