• Title/Summary/Keyword: 가중평균 앙상블

Search Result 9, Processing Time 0.022 seconds

Development of ensemble weighting technique for sequential forecasted rainfall to extend forecast precedence time (예측 선행시간 확장을 위한 순차적 예측강우 가중평균 앙상블 생성기법 개발)

  • Na, Wooyoung;Kang, Minseok;Kim, Gildo;Lee, Hyunwook;Yoo, Chulsang
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2019.05a
    • /
    • pp.59-59
    • /
    • 2019
  • 최근 기후변화로 인해 대류성 집중호우가 빈번하게 발생하고 있으며, 이러한 강우 특성은 산지지역에 위치한 소하천유역에 상당한 피해를 야기한다. 대류성 집중호우는 규모가 작고 속도가 빠르기 때문에 중규모 이상의 유역에서 부분적으로 상이한 강우특성을 보인다. 아울러 이러한 호우패턴의 변화는 일시적인 현상이 아닌 하나의 기상 특성으로 자리를 잡아가고 있기 때문에 이에 대한 대책마련이 더욱 필요한 실정이다. 돌발홍수 예경보시스템에 예측강우 자료는 예측 선행시간의 한계를 가진다. 즉, 예측강우 자료자체가 가지는 편의와 불확실성으로 인해 예측 선행시간이 3시간을 초과하면 신뢰도가 급격히 하락하게 된다. 이를 해결하기 위해 우리나라에서는 지상관측치와의 편의를 보정하거나 예측강우자료 자체의 품질을 개선하려는 노력을 지속하고 있다. 본 연구에서는 예측 선행시간을 확장하고자 순차적으로 생산되는 예측강우를 가중평균하여 앙상블 예측치를 모의하는 기법을 개발하였다. 각 선행시간별 예측강우자료를 앙상블 멤버로 인식하여 이들의 공분산 구조를 파악하고, 분산과 공분산 수치를 이용하여 가중치를 결정하였다. 1, 2, 3시간 예측 선행시간에 대한 확장 가능성을 확인하고자 하였고, 최적의 앙상블 멤버 개수를 결정하여 적용 및 평가하였다. 본 연구에서는 2016년과 2017년에 발생한 주요 호우사상을 선정하고, 우리나라 전역에 걸쳐 예측강우 앙상블 생성 방법론을 적용하였다. 그 결과, 가중평균 앙상블의 예측치가 예측강우장 1개, 단순평균 앙상블 예측치에 비해 좋은 품질의 예측 성능을 보였으며, 예측치의 분산 또한 감소하여 예측에 대한 불확실성이 줄어듦을 확인하였다.

  • PDF

Attention-Based Ensemble for Mitigating Side Effects of Data Imbalance Method (데이터 불균형 기법의 부작용 완화를 위한 어텐션 기반 앙상블)

  • Yo-Han Park;Yong-Seok Choi;Wencke Liermann;Kong Joo Lee
    • Annual Conference on Human and Language Technology
    • /
    • 2023.10a
    • /
    • pp.546-551
    • /
    • 2023
  • 일반적으로 딥러닝 모델은 모든 라벨에 데이터 수가 균형을 이룰 때 가장 좋은 성능을 보인다. 그러나 현실에서는 특정라벨에 대한 데이터가 부족한 경우가 많으며 이로 인해 불균형 데이터 문제가 발생한다. 이에 대한 해결책으로 오버샘플링과 가중치 손실과 같은 데이터 불균형 기법이 연구되었지만 이러한 기법들은 데이터가 적은 라벨의 성능을 개선하는 동시에 데이터가 많은 라벨의 성능을 저하시키는 부작용을 가지고 있다. 본 논문에서는 이 문제를 완화시키고자 어텐션 기반의 앙상블 기법을 제안한다. 어텐션 기반의 앙상블은 데이터 불균형 기법을 적용한 모델과 적용하지 않은 모델의 출력 값을 가중 평균하여 최종 예측을 수행한다. 이때 가중치는 어텐션 메커니즘을 통해 동적으로 조절된다. 그로므로 어텐션 기반의 앙상블 모델은 입력 데이터 특성에 따라 가중치를 조절할 수가 있다. 실험은 에세이 자동 평가 데이터를 대상으로 수행하였다. 실험 결과로는 제안한 모델이 데이터 불균형 기법의 부작용을 완화하고 성능이 개선되었다.

  • PDF

Incremental Ensemble Learning for The Combination of Multiple Models of Locally Weighted Regression Using Genetic Algorithm (유전 알고리즘을 이용한 국소가중회귀의 다중모델 결합을 위한 점진적 앙상블 학습)

  • Kim, Sang Hun;Chung, Byung Hee;Lee, Gun Ho
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.9
    • /
    • pp.351-360
    • /
    • 2018
  • The LWR (Locally Weighted Regression) model, which is traditionally a lazy learning model, is designed to obtain the solution of the prediction according to the input variable, the query point, and it is a kind of the regression equation in the short interval obtained as a result of the learning that gives a higher weight value closer to the query point. We study on an incremental ensemble learning approach for LWR, a form of lazy learning and memory-based learning. The proposed incremental ensemble learning method of LWR is to sequentially generate and integrate LWR models over time using a genetic algorithm to obtain a solution of a specific query point. The weaknesses of existing LWR models are that multiple LWR models can be generated based on the indicator function and data sample selection, and the quality of the predictions can also vary depending on this model. However, no research has been conducted to solve the problem of selection or combination of multiple LWR models. In this study, after generating the initial LWR model according to the indicator function and the sample data set, we iterate evolution learning process to obtain the proper indicator function and assess the LWR models applied to the other sample data sets to overcome the data set bias. We adopt Eager learning method to generate and store LWR model gradually when data is generated for all sections. In order to obtain a prediction solution at a specific point in time, an LWR model is generated based on newly generated data within a predetermined interval and then combined with existing LWR models in a section using a genetic algorithm. The proposed method shows better results than the method of selecting multiple LWR models using the simple average method. The results of this study are compared with the predicted results using multiple regression analysis by applying the real data such as the amount of traffic per hour in a specific area and hourly sales of a resting place of the highway, etc.

A Multimodal Profile Ensemble Approach to Development of Recommender Systems Using Big Data (빅데이터 기반 추천시스템 구현을 위한 다중 프로파일 앙상블 기법)

  • Kim, Minjeong;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.93-110
    • /
    • 2015
  • The recommender system is a system which recommends products to the customers who are likely to be interested in. Based on automated information filtering technology, various recommender systems have been developed. Collaborative filtering (CF), one of the most successful recommendation algorithms, has been applied in a number of different domains such as recommending Web pages, books, movies, music and products. But, it has been known that CF has a critical shortcoming. CF finds neighbors whose preferences are like those of the target customer and recommends products those customers have most liked. Thus, CF works properly only when there's a sufficient number of ratings on common product from customers. When there's a shortage of customer ratings, CF makes the formation of a neighborhood inaccurate, thereby resulting in poor recommendations. To improve the performance of CF based recommender systems, most of the related studies have been focused on the development of novel algorithms under the assumption of using a single profile, which is created from user's rating information for items, purchase transactions, or Web access logs. With the advent of big data, companies got to collect more data and to use a variety of information with big size. So, many companies recognize it very importantly to utilize big data because it makes companies to improve their competitiveness and to create new value. In particular, on the rise is the issue of utilizing personal big data in the recommender system. It is why personal big data facilitate more accurate identification of the preferences or behaviors of users. The proposed recommendation methodology is as follows: First, multimodal user profiles are created from personal big data in order to grasp the preferences and behavior of users from various viewpoints. We derive five user profiles based on the personal information such as rating, site preference, demographic, Internet usage, and topic in text. Next, the similarity between users is calculated based on the profiles and then neighbors of users are found from the results. One of three ensemble approaches is applied to calculate the similarity. Each ensemble approach uses the similarity of combined profile, the average similarity of each profile, and the weighted average similarity of each profile, respectively. Finally, the products that people among the neighborhood prefer most to are recommended to the target users. For the experiments, we used the demographic data and a very large volume of Web log transaction for 5,000 panel users of a company that is specialized to analyzing ranks of Web sites. R and SAS E-miner was used to implement the proposed recommender system and to conduct the topic analysis using the keyword search, respectively. To evaluate the recommendation performance, we used 60% of data for training and 40% of data for test. The 5-fold cross validation was also conducted to enhance the reliability of our experiments. A widely used combination metric called F1 metric that gives equal weight to both recall and precision was employed for our evaluation. As the results of evaluation, the proposed methodology achieved the significant improvement over the single profile based CF algorithm. In particular, the ensemble approach using weighted average similarity shows the highest performance. That is, the rate of improvement in F1 is 16.9 percent for the ensemble approach using weighted average similarity and 8.1 percent for the ensemble approach using average similarity of each profile. From these results, we conclude that the multimodal profile ensemble approach is a viable solution to the problems encountered when there's a shortage of customer ratings. This study has significance in suggesting what kind of information could we use to create profile in the environment of big data and how could we combine and utilize them effectively. However, our methodology should be further studied to consider for its real-world application. We need to compare the differences in recommendation accuracy by applying the proposed method to different recommendation algorithms and then to identify which combination of them would show the best performance.

Water Balance Projection Using Climate Change Scenarios in the Korean Peninsula (기후변화 시나리오를 활용한 미래 한반도 물수급 전망)

  • Kim, Cho-Rong;Kim, Young-Oh;Seo, Seung Beom;Choi, Su-Woong
    • Journal of Korea Water Resources Association
    • /
    • v.46 no.8
    • /
    • pp.807-819
    • /
    • 2013
  • This study proposes a new methodology for future water balance projection considering climate change by assigning a weight to each scenario instead of inputting future streamflows based on GCMs into a water balance model directly. K-nearest neighbor algorithm was employed to assign weights and streamflows in non-flood period (October to the following June) was selected as the criterion for assigning weights. GCM-driven precipitation was input to TANK model to simulate future streamflow scenarios and Quantile Mapping was applied to correct bias between GCM hindcast and historical data. Based on these bias-corrected streamflows, different weights were assigned to each streamflow scenarios to calculate water shortage for the projection periods; 2020s (2010~2039), 2050s (2040~2069), and 2080s (2070~2099). As a result by applying the proposed methodology to project water shortage over the Korean Peninsula, average water shortage for 2020s is projected to increase to 10~32% comparing to the basis (1967~2003). In addition, according to getting decreased in streamflows in non-flood period gradually by 2080s, average water shortage for 2080s is projected to increase up to 97% (516.5 million $m^3/yr$) as maximum comparing to the basis. While the existing research on climate change gives radical increase in future water shortage, the results projected by the weighting method shows conservative change. This study has significance in the applicability of water balance projection regarding climate change, keeping the existing framework of national water resources planning and this lessens the confusion for decision-makers in water sectors.

A Study on Bagging Neural Network for Predicting Defect Size of Steam Generator Tube in Nuclear Power Plant (원전 증기발생기 세관 결함 크기 예측을 위한 Bagging 신경회로망에 관한 연구)

  • Kim, Kyung-Jin;Jo, Nam-Hoon
    • Journal of the Korean Society for Nondestructive Testing
    • /
    • v.30 no.4
    • /
    • pp.302-310
    • /
    • 2010
  • In this paper, we studied Bagging neural network for predicting defect size of steam generator(SG) tube in nuclear power plant. Bagging is a method for creating an ensemble of estimator based on bootstrap sampling. For predicting defect size of SG tube, we first generated eddy current testing signals for 4 defect patterns of SG tube with various widths and depths. Then, we constructed single neural network(SNN) and Bagging neural network(BNN) to estimate width and depth of each defect. The estimation performance of SNN and BNN were measured by means of peak error. According to our experiment result, average peak error of SNN and BNN for estimating defect depth were 0.117 and 0.089mm, respectively. Also, in the case of estimating defect width, average peak error of SNN and BNN were 0.494 and 0.306mm, respectively. This shows that the estimation performance of BNN is superior to that of SNN.

Ordinary Kriging of Daily Mean SST (Sea Surface Temperature) around South Korea and the Analysis of Interpolation Accuracy (정규크리깅을 이용한 우리나라 주변해역 일평균 해수면온도 격자지도화 및 내삽정확도 분석)

  • Ahn, Jihye;Lee, Yangwon
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.40 no.1
    • /
    • pp.51-66
    • /
    • 2022
  • SST (Sea Surface Temperature) is based on the atmosphere-ocean interaction, one of the most important mechanisms for the Earth system. Because it is a crucial oceanic and meteorological factor for understanding climate change, gap-free grid data at a specific spatial and temporal resolution is beneficial in SST studies. This paper examined the production of daily SST grid maps from 137 stations in 2020 through the ordinary kriging with variogram optimization and their accuracy assessment. The variogram optimization was achieved by WLS (Weighted Least Squares) method, and the blind tests for the interpolation accuracy assessment were conducted by an objective and spatially unbiased sampling scheme. The four-round blind tests showed a pretty high accuracy: a root mean square error between 0.995 and 1.035℃ and a correlation coefficient between 0.981 and 0.982. In terms of season, the accuracy in summer was a bit lower, presumably because of the abrupt change in SST affected by the typhoon. The accuracy was better in the far seas than in the near seas. West Sea showed better accuracy than East or South Sea. It is because the semi-enclosed sea in the near seas can have different physical characteristics. The seasonal and regional factors should be considered for accuracy improvement in future work, and the improved SST can be a member of the SST ensemble around South Korea.

Research on ITB Contract Terms Classification Model for Risk Management in EPC Projects: Deep Learning-Based PLM Ensemble Techniques (EPC 프로젝트의 위험 관리를 위한 ITB 문서 조항 분류 모델 연구: 딥러닝 기반 PLM 앙상블 기법 활용)

  • Hyunsang Lee;Wonseok Lee;Bogeun Jo;Heejun Lee;Sangjin Oh;Sangwoo You;Maru Nam;Hyunsik Lee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.11
    • /
    • pp.471-480
    • /
    • 2023
  • The Korean construction order volume in South Korea grew significantly from 91.3 trillion won in public orders in 2013 to a total of 212 trillion won in 2021, particularly in the private sector. As the size of the domestic and overseas markets grew, the scale and complexity of EPC (Engineering, Procurement, Construction) projects increased, and risk management of project management and ITB (Invitation to Bid) documents became a critical issue. The time granted to actual construction companies in the bidding process following the EPC project award is not only limited, but also extremely challenging to review all the risk terms in the ITB document due to manpower and cost issues. Previous research attempted to categorize the risk terms in EPC contract documents and detect them based on AI, but there were limitations to practical use due to problems related to data, such as the limit of labeled data utilization and class imbalance. Therefore, this study aims to develop an AI model that can categorize the contract terms based on the FIDIC Yellow 2017(Federation Internationale Des Ingenieurs-Conseils Contract terms) standard in detail, rather than defining and classifying risk terms like previous research. A multi-text classification function is necessary because the contract terms that need to be reviewed in detail may vary depending on the scale and type of the project. To enhance the performance of the multi-text classification model, we developed the ELECTRA PLM (Pre-trained Language Model) capable of efficiently learning the context of text data from the pre-training stage, and conducted a four-step experiment to validate the performance of the model. As a result, the ensemble version of the self-developed ITB-ELECTRA model and Legal-BERT achieved the best performance with a weighted average F1-Score of 76% in the classification of 57 contract terms.

A Recidivism Prediction Model Based on XGBoost Considering Asymmetric Error Costs (비대칭 오류 비용을 고려한 XGBoost 기반 재범 예측 모델)

  • Won, Ha-Ram;Shim, Jae-Seung;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.127-137
    • /
    • 2019
  • Recidivism prediction has been a subject of constant research by experts since the early 1970s. But it has become more important as committed crimes by recidivist steadily increase. Especially, in the 1990s, after the US and Canada adopted the 'Recidivism Risk Assessment Report' as a decisive criterion during trial and parole screening, research on recidivism prediction became more active. And in the same period, empirical studies on 'Recidivism Factors' were started even at Korea. Even though most recidivism prediction studies have so far focused on factors of recidivism or the accuracy of recidivism prediction, it is important to minimize the prediction misclassification cost, because recidivism prediction has an asymmetric error cost structure. In general, the cost of misrecognizing people who do not cause recidivism to cause recidivism is lower than the cost of incorrectly classifying people who would cause recidivism. Because the former increases only the additional monitoring costs, while the latter increases the amount of social, and economic costs. Therefore, in this paper, we propose an XGBoost(eXtream Gradient Boosting; XGB) based recidivism prediction model considering asymmetric error cost. In the first step of the model, XGB, being recognized as high performance ensemble method in the field of data mining, was applied. And the results of XGB were compared with various prediction models such as LOGIT(logistic regression analysis), DT(decision trees), ANN(artificial neural networks), and SVM(support vector machines). In the next step, the threshold is optimized to minimize the total misclassification cost, which is the weighted average of FNE(False Negative Error) and FPE(False Positive Error). To verify the usefulness of the model, the model was applied to a real recidivism prediction dataset. As a result, it was confirmed that the XGB model not only showed better prediction accuracy than other prediction models but also reduced the cost of misclassification most effectively.