• Title/Summary/Keyword: 층화샘플링

Search Result 7, Processing Time 0.026 seconds

Development of fecal coliform prediction model using random forest method (랜덤포레스트기법을 이용한 분변성대장균 예측모델 개발)

  • Seo, Il Won;Choi, Soo Yeon
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2016.05a
    • /
    • pp.124-124
    • /
    • 2016
  • 하천에서의 분변성대장균은 분변성 오염 정도를 나타내는 지표로서, 이 농도가 높을수록 오염된 하천수와의 접촉을 통한 호흡기, 소화기 및 피부 관련 질병의 발발 확률이 높다고 알려져 있다. 따라서 하천에서의 수영, 수상스키 등과 같은 입수형 친수활동을 할 때, 분변성대장균 농도가 농도 기준 이하인지를 확인하고 이러한 정보를 친수활동에 이용할 필요가 있다. 그러나 분변성대장균의 경우, 현재 자동수질측정망에서 측정되고 있는 다른 수질인자들과는 달리 실시간 측정이 불가능하다고 알려져 있다. 분변성대장균을 측정하는데 있어 최소 18시간 이상이 필요하며, 이러한 분변성대장균 측정 방식은 하천 이용자들이 안전한 친수활동을 영위하는데 있어 적절한 수질 정보를 제공하지 못한다. 그러므로 분변성대장균을 예측하는 모델을 개발하고, 이를 이용하여 실시간 분변성대장균 정보를 생성하여 하천 이용자들에게 제공할 필요가 있다. 본 연구에서는 친수활동이 활발하게 이루어지는 곳 중 하나인 북한강의 대성리 지점에 대해 데이터 기반 모델을 이용하여 분변성대장균을 예측하였다. 데이터 기반 모델은 물리 기반 모델에서 필요한 지형데이터나 비점오염원 등의 초기 오염물의 양에 대한 데이터를 필요로 하지 않고, 대신 독립변수로 사용되는 기상 및 수질데이터를 필요로 한다. 이러한 기상 및 수질데이터는 기존 기상관측소, 수질관측소에서 매일 자동으로 측정되기 때문에 데이터 기반 모델은 물리 기반 모델에 비해 입력데이터를 구성하기가 쉽다는 장점을 지닌다. 이러한 데이터 기반 모델 중 분류 모델은 회귀 모델과 달리 분변성대장균 농도가 일정 수질기준 이상을 넘는지를 바로 예측할 수 있다. 본 연구에서는 분류 모델 중 높은 예측력을 가진다고 알려진 랜덤포레스트(random forest) 기법을 이용하여 분변성대장균 예측 모델을 개발하였다. 분변성대장균 예측 모델은 주어진 기상 및 수질 조건에 대해 분변성대장균이 200 CFU/100ml가 넘는지를 예측하였다. 예측된 분변성대장균이 기준을 넘는 경우를 2등급, 넘지 않는 경우를 1등급으로 명명하였다. 모델을 개발하기 위하여 북한강 대성리 인근 측정소에서 2010년부터 2015년까지 측정된 기상 및 수질데이터를 수집하였다. 수집한 데이터를 훈련 및 검증데이터로 샘플링하였으며, 이 때 샘플링한 데이터가 기존 데이터가 가지고 있던 등급별 비율을 유지하기 위하여 층화샘플링을 하였다. 본 연구에서는 샘플링에 의한 불확실성을 줄이기 위하여 랜덤하게 50번 샘플링된 각각의 훈련데이터에 대해 모델을 개발하였다. 50개의 모델의 검증 결과를 종합한 결과, 전체 예측률은 0.139로 나타났다.

  • PDF

Comparison of Sampling Techniques for Passive Internet Measurement: An Inspection using An Empirical Study (수동적 인터넷 측정을 위한 샘플링 기법 비교: 사례 연구를 통한 검증)

  • Kim, Jung-Hyun;Won, You-Jip;Ahn, Soo-Han
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.45 no.6
    • /
    • pp.34-51
    • /
    • 2008
  • Today, the Internet is a part of our life. For that reason, we regard revealing characteristics of Internet traffic as an important research theme. However, Internet traffic cannot be easily manipulated because it usually occupy huge capacity. This problem is a serious obstacle to analyze Internet traffic. Many researchers use various sampling techniques to reduce capacity of Internet traffic. In this paper, we compare several famous sampling techniques, and propose efficient sampling scheme. We chose some sampling techniques such as Systematic Sampling, Simple Random Sampling and Stratified Sampling with some sampling intensities such as 1/10, 1/100 and 1/1000. Our observation focused on Traffic Volume, Entropy Analysis and Packet Size Analysis. Both the simple random sampling and the count-based systematic sampling is proper to general case. On the other hand, time-based systematic sampling exhibits relatively bad results. The stratified sampling on Transport Layer Protocols, e.g.. TCP, UDP and so on, shows superior results. Our analysis results suggest that efficient sampling techniques satisfactorily maintain variation of traffic stream according to time change. The entropy analysis endures various sampling techniques well and fits detecting anomalous traffic. We found that a traffic volume diminishment caused by bottleneck could induce wrong results on the entropy analysis. We discovered that Packet Size Distribution perfectly tolerate any packet sampling techniques and intensities.

A New Statistical Sampling Method for Reducing Computing time of Machine Learning Algorithms (기계학습 알고리즘의 컴퓨팅시간 단축을 위한 새로운 통계적 샘플링 기법)

  • Jun, Sung-Hae
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.21 no.2
    • /
    • pp.171-177
    • /
    • 2011
  • Accuracy and computing time are considerable issues in machine learning. In general, the computing time for data analysis is increased in proportion to the size of given data. So, we need a sampling approach to reduce the size of training data. But, the accuracy of constructed model is decreased by going down the data size simultaneously. To solve this problem, we propose a new statistical sampling method having similar performance to the total data. We suggest a rule to select optimal sampling techniques according to given data structure. This paper shows a sampling method for reducing computing time with keeping the most of accuracy using cluster sampling, stratified sampling, and systematic sampling. We verify improved performance of proposed method by accuracy and computing time between sample data and total data using objective machine learning data sets.

A stratified random sampling design for paddy fields: Optimized stratification and sample allocation for effective spatial modeling and mapping of the impact of climate changes on agricultural system in Korea (농지 공간격자 자료의 층화랜덤샘플링: 농업시스템 기후변화 영향 공간모델링을 위한 국내 농지 최적 층화 및 샘플 수 최적화 연구)

  • Minyoung Lee;Yongeun Kim;Jinsol Hong;Kijong Cho
    • Korean Journal of Environmental Biology
    • /
    • v.39 no.4
    • /
    • pp.526-535
    • /
    • 2021
  • Spatial sampling design plays an important role in GIS-based modeling studies because it increases modeling efficiency while reducing the cost of sampling. In the field of agricultural systems, research demand for high-resolution spatial databased modeling to predict and evaluate climate change impacts is growing rapidly. Accordingly, the need and importance of spatial sampling design are increasing. The purpose of this study was to design spatial sampling of paddy fields (11,386 grids with 1 km spatial resolution) in Korea for use in agricultural spatial modeling. A stratified random sampling design was developed and applied in 2030s, 2050s, and 2080s under two RCP scenarios of 4.5 and 8.5. Twenty-five weather and four soil characteristics were used as stratification variables. Stratification and sample allocation were optimized to ensure minimum sample size under given precision constraints for 16 target variables such as crop yield, greenhouse gas emission, and pest distribution. Precision and accuracy of the sampling were evaluated through sampling simulations based on coefficient of variation (CV) and relative bias, respectively. As a result, the paddy field could be optimized in the range of 5 to 21 strata and 46 to 69 samples. Evaluation results showed that target variables were within precision constraints (CV<0.05 except for crop yield) with low bias values (below 3%). These results can contribute to reducing sampling cost and computation time while having high predictive power. It is expected to be widely used as a representative sample grid in various agriculture spatial modeling studies.

Design-based Variance Estimation under stratified Multi-stage Sampling (층화 다단계 샘플링에서 설계 기반 분산추정)

  • 김규성
    • Proceedings of the Korean Association for Survey Research Conference
    • /
    • 2001.04a
    • /
    • pp.59-71
    • /
    • 2001
  • We investigate design-based variance estimation methods of homogeneous linear estimator for population total under stratified multi-stage sampling. One method is unbiasedly estimating the first stage variance and the second stage variance separately in each stratum. And another is sub-sampling method that estimating the first stage variance only by using sub-sample selected from the second stage sample so that resulting estimator is unbiased for the total variance. The first is useful when the second stage unbiased estimator is available and the second is when the second stage variance is not estimable. For each case, we proposed a form of non-negative unbiased variance estimator. We expect the proposed variance estimation methods can be effectively used for many practical surveys.

Design-based Variance Estimation under Stratified Multi-stage Sampling (층화 다단계 샘플링에서 설계 기반 분산추정)

  • 김규성
    • Survey Research
    • /
    • v.2 no.1
    • /
    • pp.59-71
    • /
    • 2001
  • We investigate design-based variance estimation methods of homogeneous linear estimator for population total under stratified multi-stage sampling. One method is unbiasedly estimating the first stage variance and the second stage variance separately in each stratum. And another is sub-sampling method that estimating the first stage variance only by using sub-sample selected from the second stage sample so that resulting estimator is unbiased for the total variance. The first is useful when the second stage unbiased estimator is available and the second is when the second stage variance is not estimable. For each case, we proposed a form of non-negative unbiased variance estimator. We expect the proposed variance estimation methods can be effectively used for many practical surveys.

  • PDF

Identification of User Preference Factor Using Review Information (리뷰 정보를 활용한 이용자의 선호요인 식별에 관한 연구)

  • Song, Sungjeon;Shim, Jiyoung
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.3
    • /
    • pp.311-336
    • /
    • 2022
  • This study analyzed the contents of Goodreads review data, which is a social cataloging service with the participation of book users around the world, to identify the preference factors that affect book users' book recommendations in the library information service environment. To understand user preferences from a more detailed point of view, sub-datasets for each rating group, each book, and each user were constructed in the sample selection process. Stratified sampling was also performed based on the result of topic modeling of review text data to include various topics. As a result, a total of 90 preference factors belonging to 7 categories('Content', 'Character', 'Writing', 'Reading', 'Author', 'Story', 'Form') were identified. Also, the general preference factors revealed according to the ratings, as well as the patterns of preference factors revealed in books and users with clear likes and dislikes were identified. The results of this study are expected to contribute to more sophisticated recommendations in future recommendation systems by identifying specific aspects of user preference factors.