• Title/Summary/Keyword: 앙상블 학습 기법

Search Result 95, Processing Time 0.023 seconds

A Non-annotated Recurrent Neural Network Ensemble-based Model for Near-real Time Detection of Erroneous Sea Level Anomaly in Coastal Tide Gauge Observation (비주석 재귀신경망 앙상블 모델을 기반으로 한 조위관측소 해수위의 준실시간 이상값 탐지)

  • LEE, EUN-JOO;KIM, YOUNG-TAEG;KIM, SONG-HAK;JU, HO-JEONG;PARK, JAE-HUN
    • The Sea:JOURNAL OF THE KOREAN SOCIETY OF OCEANOGRAPHY
    • /
    • v.26 no.4
    • /
    • pp.307-326
    • /
    • 2021
  • Real-time sea level observations from tide gauges include missing and erroneous values. Classification as abnormal values can be done for the latter by the quality control procedure. Although the 3𝜎 (three standard deviations) rule has been applied in general to eliminate them, it is difficult to apply it to the sea-level data where extreme values can exist due to weather events, etc., or where erroneous values can exist even within the 3𝜎 range. An artificial intelligence model set designed in this study consists of non-annotated recurrent neural networks and ensemble techniques that do not require pre-labeling of the abnormal values. The developed model can identify an erroneous value less than 20 minutes of tide gauge recording an abnormal sea level. The validated model well separates normal and abnormal values during normal times and weather events. It was also confirmed that abnormal values can be detected even in the period of years when the sea level data have not been used for training. The artificial neural network algorithm utilized in this study is not limited to the coastal sea level, and hence it can be extended to the detection model of erroneous values in various oceanic and atmospheric data.

A Study on the Prediction of Rock Classification Using Shield TBM Data and Machine Learning Classification Algorithms (쉴드 TBM 데이터와 머신러닝 분류 알고리즘을 이용한 암반 분류 예측에 관한 연구)

  • Kang, Tae-Ho;Choi, Soon-Wook;Lee, Chulho;Chang, Soo-Ho
    • Tunnel and Underground Space
    • /
    • v.31 no.6
    • /
    • pp.494-507
    • /
    • 2021
  • With the increasing use of TBM, research has recently been conducted in Korea to analyze TBM data with machine learning techniques to predict the ground in front of TBM, predict the exchange cycle of disk cutters, and predict the advance rate of TBM. In this study, classification prediction of rock characteristics of slurry shield TBM sites was made by combining traditional rock classification techniques and machine learning techniques widely used in various fields with machine data during TBM excavation. The items of rock characteristic classification criteria were set as RQD, uniaxial compression strength, and elastic wave speed, and the rock conditions for each item were classified into three classes: class 0 (good), 1 (normal), and 2 (poor), and machine learning was performed on six class algorithms. As a result, the ensemble model showed good performance, and the LigthtGBM model, which showed excellent results in learning speed as well as learning performance, was found to be optimal in the target site ground. Using the classification model for the three rock characteristics set in this study, it is believed that it will be possible to provide rock conditions for sections where ground information is not provided, which will help during excavation work.

A Study on the Prediction of Disc Cutter Wear Using TBM Data and Machine Learning Algorithm (TBM 데이터와 머신러닝 기법을 이용한 디스크 커터마모 예측에 관한 연구)

  • Tae-Ho, Kang;Soon-Wook, Choi;Chulho, Lee;Soo-Ho, Chang
    • Tunnel and Underground Space
    • /
    • v.32 no.6
    • /
    • pp.502-517
    • /
    • 2022
  • As the use of TBM increases, research has recently increased to to analyze TBM data with machine learning techniques to predict the exchange cycle of disc cutters, and predict the advance rate of TBM. In this study, a regression prediction of disc cutte wear of slurry shield TBM site was made by combining machine learning based on the machine data and the geotechnical data obtained during the excavation. The data were divided into 7:3 for training and testing the prediction of disc cutter wear, and the hyper-parameters are optimized by cross-validated grid-search over a parameter grid. As a result, gradient boosting based on the ensemble model showed good performance with a determination coefficient of 0.852 and a root-mean-square-error of 3.111 and especially excellent results in fit times along with learning performance. Based on the results, it is judged that the suitability of the prediction model using data including mechanical data and geotechnical information is high. In addition, research is needed to increase the diversity of ground conditions and the amount of disc cutter data.

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

  • Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.21-44
    • /
    • 2018
  • In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.

Monitoring Ground-level SO2 Concentrations Based on a Stacking Ensemble Approach Using Satellite Data and Numerical Models (위성 자료와 수치모델 자료를 활용한 스태킹 앙상블 기반 SO2 지상농도 추정)

  • Choi, Hyunyoung;Kang, Yoojin;Im, Jungho;Shin, Minso;Park, Seohui;Kim, Sang-Min
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_3
    • /
    • pp.1053-1066
    • /
    • 2020
  • Sulfur dioxide (SO2) is primarily released through industrial, residential, and transportation activities, and creates secondary air pollutants through chemical reactions in the atmosphere. Long-term exposure to SO2 can result in a negative effect on the human body causing respiratory or cardiovascular disease, which makes the effective and continuous monitoring of SO2 crucial. In South Korea, SO2 monitoring at ground stations has been performed, but this does not provide spatially continuous information of SO2 concentrations. Thus, this research estimated spatially continuous ground-level SO2 concentrations at 1 km resolution over South Korea through the synergistic use of satellite data and numerical models. A stacking ensemble approach, fusing multiple machine learning algorithms at two levels (i.e., base and meta), was adopted for ground-level SO2 estimation using data from January 2015 to April 2019. Random forest and extreme gradient boosting were used as based models and multiple linear regression was adopted for the meta-model. The cross-validation results showed that the meta-model produced the improved performance by 25% compared to the base models, resulting in the correlation coefficient of 0.48 and root-mean-square-error of 0.0032 ppm. In addition, the temporal transferability of the approach was evaluated for one-year data which were not used in the model development. The spatial distribution of ground-level SO2 concentrations based on the proposed model agreed with the general seasonality of SO2 and the temporal patterns of emission sources.

An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest (랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.36 no.2
    • /
    • pp.57-77
    • /
    • 2019
  • Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms of classification performance that automatically assigns class labels to domestic journals. Through this, I explored ways to optimize the performance of random forests (RF) for imbalanced datasets in real environments. Consequently, for the automatic classification of domestic journal articles, Random Forest (RF) can be expected to have the best classification performance when using tree number interval 100~1000(C), small feature set (10%) based on chi-square statistic (CHI), and most learning sets (9-10 years).

A Study On The Classification Of Driver's Sleep State While Driving Through BCG Signal Optimization (BCG 신호 최적화를 통한 주행중 운전자 수면 상태 분류에 관한 연구)

  • Park, Jin Su;Jeong, Ji Seong;Yang, Chul Seung;Lee, Jeong Gi
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.6
    • /
    • pp.905-910
    • /
    • 2022
  • Drowsy driving requires a lot of social attention because it increases the incidence of traffic accidents and leads to fatal accidents. The number of accidents caused by drowsy driving is increasing every year. Therefore, in order to solve this problem all over the world, research for measuring various biosignals is being conducted. Among them, this paper focuses on non-contact biosignal analysis. Various noises such as engine, tire, and body vibrations are generated in a running vehicle. To measure the driver's heart rate and respiration rate in a driving vehicle with a piezoelectric sensor, a sensor plate that can cushion vehicle vibrations was designed and noise generated from the vehicle was reduced. In addition, we developed a system for classifying whether the driver is sleeping or not by extracting the model using the CNN-LSTM ensemble learning technique based on the signal of the piezoelectric sensor. In order to learn the sleep state, the subject's biosignals were acquired every 30 seconds, and 797 pieces of data were comparatively analyzed.

A Study on the Timing of Starting Pitcher Replacement Using Machine Learning (머신러닝을 활용한 선발 투수 교체시기에 관한 연구)

  • Noh, Seongjin;Noh, Mijin;Han, Mumoungcho;Um, Sunhyun;Kim, Yangsok
    • Smart Media Journal
    • /
    • v.11 no.2
    • /
    • pp.9-17
    • /
    • 2022
  • The purpose of this study is to implement a predictive model to support decision-making to replace a starting pitcher before a crisis situation in a baseball game. To this end, using the Major League Statcast data provided by Baseball Savant, we implement a predictive model that preemptively replaces starting pitchers before a crisis situation. To this end, first, the crisis situation that the starting pitcher faces in the game was derived through data exploration. Second, if the starting pitcher was replaced before the end of the inning, learning was carried out by composing a label with a replacement in the previous inning. As a result of comparing the trained models, the model based on the ensemble method showed the highest predictive performance with an F1-Score of 65%. The practical significance of this study is that the proposed model can contribute to increasing the team's winning probability by replacing the starting pitcher before a crisis situation, and the coach will be able to receive data-based strategic decision-making support during the game.

Sensor Fault Detection Scheme based on Deep Learning and Support Vector Machine (딥 러닝 및 서포트 벡터 머신기반 센서 고장 검출 기법)

  • Yang, Jae-Wan;Lee, Young-Doo;Koo, In-Soo
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.18 no.2
    • /
    • pp.185-195
    • /
    • 2018
  • As machines have been automated in the field of industries in recent years, it is a paramount importance to manage and maintain the automation machines. When a fault occurs in sensors attached to the machine, the machine may malfunction and further, a huge damage will be caused in the process line. To prevent the situation, the fault of sensors should be monitored, diagnosed and classified in a proper way. In the paper, we propose a sensor fault detection scheme based on SVM and CNN to detect and classify typical sensor errors such as erratic, drift, hard-over, spike, and stuck faults. Time-domain statistical features are utilized for the learning and testing in the proposed scheme, and the genetic algorithm is utilized to select the subset of optimal features. To classify multiple sensor faults, a multi-layer SVM is utilized, and ensemble technique is used for CNN. As a result, the SVM that utilizes a subset of features selected by the genetic algorithm provides better performance than the SVM that utilizes all the features. However, the performance of CNN is superior to that of the SVM.

A Machine Learning-Based Encryption Behavior Cognitive Technique for Ransomware Detection (랜섬웨어 탐지를 위한 머신러닝 기반 암호화 행위 감지 기법)

  • Yoon-Cheol Hwang
    • Journal of Industrial Convergence
    • /
    • v.21 no.12
    • /
    • pp.55-62
    • /
    • 2023
  • Recent ransomware attacks employ various techniques and pathways, posing significant challenges in early detection and defense. Consequently, the scale of damage is continually growing. This paper introduces a machine learning-based approach for effective ransomware detection by focusing on file encryption and encryption patterns, which are pivotal functionalities utilized by ransomware. Ransomware is identified by analyzing password behavior and encryption patterns, making it possible to detect specific ransomware variants and new types of ransomware, thereby mitigating ransomware attacks effectively. The proposed machine learning-based encryption behavior detection technique extracts encryption and encryption pattern characteristics and trains them using a machine learning classifier. The final outcome is an ensemble of results from two classifiers. The classifier plays a key role in determining the presence or absence of ransomware, leading to enhanced accuracy. The proposed technique is implemented using the numpy, pandas, and Python's Scikit-Learn library. Evaluation indicators reveal an average accuracy of 94%, precision of 95%, recall rate of 93%, and an F1 score of 95%. These performance results validate the feasibility of ransomware detection through encryption behavior analysis, and further research is encouraged to enhance the technique for proactive ransomware detection.