• Title/Summary/Keyword: Random Forest Classification

Search Result 311, Processing Time 0.023 seconds

Evaluation of Applicability of Sea Ice Monitoring Using Random Forest Model Based on GOCI-II Images: A Study of Liaodong Bay 2021-2022 (GOCI-II 영상 기반 Random Forest 모델을 이용한 해빙 모니터링 적용 가능성 평가: 2021-2022년 랴오둥만을 대상으로)

  • Jinyeong Kim;Soyeong Jang;Jaeyeop Kwon;Tae-Ho Kim
    • Korean Journal of Remote Sensing
    • /
    • v.39 no.6_2
    • /
    • pp.1651-1669
    • /
    • 2023
  • Sea ice currently covers approximately 7% of the world's ocean area, primarily concentrated in polar and high-altitude regions, subject to seasonal and annual variations. It is very important to analyze the area and type classification of sea ice through time series monitoring because sea ice is formed in various types on a large spatial scale, and oil and gas exploration and other marine activities are rapidly increasing. Currently, research on the type and area of sea ice is being conducted based on high-resolution satellite images and field measurement data, but there is a limit to sea ice monitoring by acquiring field measurement data. High-resolution optical satellite images can visually detect and identify types of sea ice in a wide range and can compensate for gaps in sea ice monitoring using Geostationary Ocean Color Imager-II (GOCI-II), an ocean satellite with short time resolution. This study tried to find out the possibility of utilizing sea ice monitoring by training a rule-based machine learning model based on learning data produced using high-resolution optical satellite images and performing detection on GOCI-II images. Learning materials were extracted from Liaodong Bay in the Bohai Sea from 2021 to 2022, and a Random Forest (RF) model using GOCI-II was constructed to compare qualitative and quantitative with sea ice areas obtained from existing normalized difference snow index (NDSI) based and high-resolution satellite images. Unlike NDSI index-based results, which underestimated the sea ice area, this study detected relatively detailed sea ice areas and confirmed that sea ice can be classified by type, enabling sea ice monitoring. If the accuracy of the detection model is improved through the construction of continuous learning materials and influencing factors on sea ice formation in the future, it is expected that it can be used in the field of sea ice monitoring in high-altitude ocean areas.

Comparison of resampling methods for dealing with imbalanced data in binary classification problem (이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교)

  • Park, Geun U;Jung, Inkyung
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.349-374
    • /
    • 2019
  • A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

A Study of Big Data Domain Automatic Classification Using Machine Learning (머신러닝을 이용한 빅데이터 도메인 자동 판별에 관한 연구)

  • Kong, Seongwon;Hwang, Deokyoul
    • The Journal of Bigdata
    • /
    • v.3 no.2
    • /
    • pp.11-18
    • /
    • 2018
  • This study is a study on domain automatic classification for domain - based quality diagnosis which is a key element of big data quality diagnosis. With the increase of the value and utilization of Big Data and the rise of the Fourth Industrial Revolution, the world is making efforts to create new value by utilizing big data in various fields converged with IT such as law, medical, and finance. However, analysis based on low-reliability data results in critical problems in both the process and the result, and it is also difficult to believe that judgments based on the analysis results. Although the need of highly reliable data has also increased, research on the quality of data and its results have been insufficient. The purpose of this study is to shorten the work time to automizing the domain classification work which was performed from manually to using machine learning in the domain - based quality diagnosis, which is a key element of diagnostic evaluation for improving data quality. Extracts information about the characteristics of the data that is stored in the database and identifies the domain, and then featurize it, and automizes the domain classification using machine learning. We will use it for big data quality diagnosis and contribute to quality improvement.

Ensemble Based Optimal Feature Selection Algorithm for Efficient Intrusion Detection in Wireless Sensor Network

  • Shyam Sundar S;R.S. Bhuvaneswaran;SaiRamesh L
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.8
    • /
    • pp.2214-2229
    • /
    • 2024
  • Wireless sensor network (WSN) consists of large number of sensor nodes that are deployed in geographical locations to collect sensed information, process data and communicate it to the control station for further processing. Due the unfriendly environment where the sensors are deployed, there exist many possibilities of malicious nodes which performs malicious activities in the network. Therefore, the security threats affect performance and life time of sensor networks, whereas various security aspects are there to address security issues in WSN namely Cryptography, Trust Management, Intrusion Detection System (IDS) and Intrusion Prevention Systems (IPS). However, IDS detect the malicious activities and produce an alarm. These malicious activities exploit vulnerabilities in the network layer and affect all layers in the network. Existing feature selection methods such as filter-based methods are not considering the redundancy of the selected features and wrapper method has high risk of overfitting the classification of intrusion. Due to overfitting, the classification algorithm fails to detect the intrusion in better manner. The main objective of this paper is to provide the efficient feature selection algorithm which was suitable for any type classification algorithm to detect the intrusion in an effective manner. This paper, the security of the network is addressed by proposing Feature Selection Algorithm using Chi Squared with Ensemble Method (FSChE). The proposed scheme employs the combination of decision tree along with the random forest classification algorithm to form ensemble classifier. The experimental results justify the feasibility of the proposed scheme in terms of attack detection, packet delivery ratio and time analysis by employing NSL KDD cup data Set. The obtained results shows that the proposed ensemble method increases the overall performance by 10% to 25% with respect to mentioned parameters.

A Study on Chaff Echo Detection using AdaBoost Algorithm and Radar Data (AdaBoost 알고리즘과 레이더 데이터를 이용한 채프에코 식별에 관한 연구)

  • Lee, Hansoo;Kim, Jonggeun;Yu, Jungwon;Jeong, Yeongsang;Kim, Sungshin
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.23 no.6
    • /
    • pp.545-550
    • /
    • 2013
  • In pattern recognition field, data classification is an essential process for extracting meaningful information from data. Adaptive boosting algorithm, known as AdaBoost algorithm, is a kind of improved boosting algorithm for applying to real data analysis. It consists of weak classifiers, such as random guessing or random forest, which performance is slightly more than 50% and weights for combining the classifiers. And a strong classifier is created with the weak classifiers and the weights. In this paper, a research is performed using AdaBoost algorithm for detecting chaff echo which has similar characteristics to precipitation echo and interrupts weather forecasting. The entire process for implementing chaff echo classifier starts spatial and temporal clustering based on similarity with weather radar data. With them, learning data set is prepared that separated chaff echo and non-chaff echo, and the AdaBoost classifier is generated as a result. For verifying the classifier, actual chaff echo appearance case is applied, and it is confirmed that the classifier can distinguish chaff echo efficiently.

Study of Genetic Diversity and Taxonomy of Genus Sorbus in Korea Using Random Amplified Polymorphic DNA (RAPD에 의한 한국내 마가목속 식물의 유전적 다양성과 분류학적 연구)

  • Park, So-Hye;Kim, Sea-Hyun;Seo, Hee-Won;Huh, Man-Kyu
    • Journal of Life Science
    • /
    • v.17 no.4 s.84
    • /
    • pp.470-475
    • /
    • 2007
  • Genus Sorbus is a long lived woody species. Plants of this genus are primarily distributed patchy throughout Asia and Europe. Sorbus commixta is primarily distributed throughout Europe. Eastern Asian Sorbus species are regarded as very important herbal medicines in Korea and China. Random amplified polymorphic DNA (RAPD) was used to investigate the genetic variation and phylogenetic analysis of four species of this genus in Korea. Although some Korean populations of these species were isolated and patchily distributed, they exhibited a high level of genetic diversity. Twenty-six primers revealed 205 loci, of which 128 were polymorphic (62.4%). S. commixta showed the highest diversity (0.165), whereas S. aucuparia showed the lowest diversity (0.109). The estimated gene flow (Nm) was low high among intra-species (mean Nm=0.755). A similarity matrix based on the proportion of shared fragments (GS) was used to evaluate relatedness among species. The estimate of GS ranged from 0.786 to 0.963. The molecular data allowed us to resolve well-supported clades in Korean taxa and European species. An addition, especially, species-specific markers for genus Sorbus by RAPD analysis may be useful in germ-plasm classification and agricultural process of several taxa of this genus.

Object Classification Using Point Cloud and True Ortho-image by Applying Random Forest and Support Vector Machine Techniques (랜덤포레스트와 서포트벡터머신 기법을 적용한 포인트 클라우드와 실감정사영상을 이용한 객체분류)

  • Seo, Hong Deok;Kim, Eui Myoung
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.37 no.6
    • /
    • pp.405-416
    • /
    • 2019
  • Due to the development of information and communication technology, the production and processing speed of data is getting faster. To classify objects using machine learning, which is a field of artificial intelligence, data required for training can be easily collected due to the development of internet and geospatial information technology. In the field of geospatial information, machine learning is also being applied to classify or recognize objects using images and point clouds. In this study, the problem of manually constructing training data using existing digital map version 1.0 was improved, and the technique of classifying roads, buildings and vegetation using image and point clouds were proposed. Through experiments, it was possible to classify roads, buildings, and vegetation that could clearly distinguish colors when using true ortho-image with only RGB (Red, Green, Blue) bands. However, if the colors of the objects to be classified are similar, it was possible to identify the limitations of poor classification of the objects. To improve the limitations, random forest and support vector machine techniques were applied after band fusion of true ortho-image and normalized digital surface model, and roads, buildings, and vegetation were classified with more than 85% accuracy.

A Comparative Study of Reservoir Surface Area Detection Algorithm Using SAR Image (SAR 영상을 활용한 저수지 수표면적 탐지 알고리즘 비교 연구)

  • Jeong, Hagyu;Park, Jongsoo;Lee, Dalgeun;Lee, Junwoo
    • Korean Journal of Remote Sensing
    • /
    • v.38 no.6_3
    • /
    • pp.1777-1788
    • /
    • 2022
  • The reservoir is a major water supply source in the domestic agricultural environment, and the monitoring of water storage of reservoirs is important for the utilization and management of agricultural water resource. Remote sensing via satellite imagery can be an effective method for regular monitoring of widely distributed objects such as reservoirs, and in this study, image classification and image segmentation algorithms are applied to Sentinel-1 Synthetic Aperture Radar (SAR) imagery for water body detection in 53 reservoirs in South Korea. Six algorithms are used: Neural Network (NN), Support Vector Machine (SVM), Random Forest (RF), Otsu, Watershed (WS), and Chan-Vese (CV), and the results of water body detection are evaluated with in-situ images taken by drones. The correlations between the in-situ water surface area and detected water surface area from each algorithm are NN 0.9941, SVM 0.9942, RF 0.9940, Otsu 0.9922, WS 0.9709, and CV 0.9736, and the larger the scale of reservoir, the higher the linear correlation was. WS showed low recall due to the undetected water bodies, and NN, SVM, and RF showed low precision due to over-detection. For water body detection through SAR imagery, we found that aquatic plants and artificial structures can be the error factors causing undetection of water body.

Personalized insurance product based on similarity (유사도를 활용한 맞춤형 보험 추천 시스템)

  • Kim, Joon-Sung;Cho, A-Ra;Oh, Hayong
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.11
    • /
    • pp.1599-1607
    • /
    • 2022
  • The data mainly used for the model are as follows: the personal information, the information of insurance product, etc. With the data, we suggest three types of models: content-based filtering model, collaborative filtering model and classification models-based model. The content-based filtering model finds the cosine of the angle between the users and items, and recommends items based on the cosine similarity; however, before finding the cosine similarity, we divide into several groups by their features. Segmentation is executed by K-means clustering algorithm and manually operated algorithm. The collaborative filtering model uses interactions that users have with items. The classification models-based model uses decision tree and random forest classifier to recommend items. According to the results of the research, the contents-based filtering model provides the best result. Since the model recommends the item based on the demographic and user features, it indicates that demographic and user features are keys to offer more appropriate items.

Corporate Bankruptcy Prediction Model using Explainable AI-based Feature Selection (설명가능 AI 기반의 변수선정을 이용한 기업부실예측모형)

  • Gundoo Moon;Kyoung-jae Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.2
    • /
    • pp.241-265
    • /
    • 2023
  • A corporate insolvency prediction model serves as a vital tool for objectively monitoring the financial condition of companies. It enables timely warnings, facilitates responsive actions, and supports the formulation of effective management strategies to mitigate bankruptcy risks and enhance performance. Investors and financial institutions utilize default prediction models to minimize financial losses. As the interest in utilizing artificial intelligence (AI) technology for corporate insolvency prediction grows, extensive research has been conducted in this domain. However, there is an increasing demand for explainable AI models in corporate insolvency prediction, emphasizing interpretability and reliability. The SHAP (SHapley Additive exPlanations) technique has gained significant popularity and has demonstrated strong performance in various applications. Nonetheless, it has limitations such as computational cost, processing time, and scalability concerns based on the number of variables. This study introduces a novel approach to variable selection that reduces the number of variables by averaging SHAP values from bootstrapped data subsets instead of using the entire dataset. This technique aims to improve computational efficiency while maintaining excellent predictive performance. To obtain classification results, we aim to train random forest, XGBoost, and C5.0 models using carefully selected variables with high interpretability. The classification accuracy of the ensemble model, generated through soft voting as the goal of high-performance model design, is compared with the individual models. The study leverages data from 1,698 Korean light industrial companies and employs bootstrapping to create distinct data groups. Logistic Regression is employed to calculate SHAP values for each data group, and their averages are computed to derive the final SHAP values. The proposed model enhances interpretability and aims to achieve superior predictive performance.