Search | Korea Science

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

Kim, HanYong;Lee, Woojoo
- The Korean Journal of Applied Statistics
- /
- v.30 no.5
- /
- pp.681-690
- /
- 2017
Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.
https://doi.org/10.5351/KJAS.2017.30.5.681 인용 PDF KSCI

Selecting the optimal threshold based on impurity index in imbalanced classification (불균형 자료에서 불순도 지수를 활용한 분류 임계값 선택)

Jang, Shuin;Yeo, In-Kwon
- The Korean Journal of Applied Statistics
- /
- v.34 no.5
- /
- pp.711-721
- /
- 2021
In this paper, we propose the method of adjusting thresholds using impurity indices in classification analysis on imbalanced data. Suppose the minority category is Positive and the majority category is Negative for the imbalanced binomial data. When categories are determined based on the commonly used 0.5 basis, the specificity tends to be high in unbalanced data while the sensitivity is relatively low. Increasing sensitivity is important when proper classification of objects in minority categories is relatively important. We explore how to increase sensitivity through adjusting thresholds. Existing studies have adjusted thresholds based on measures such as G-Mean and F1-score, but in this paper, we propose a method to select optimal thresholds using the chi-square statistic of CHAID, the Gini index of CART, and the entropy of C4.5. We also introduce how to get a possible unique value when multiple optimal thresholds are obtained. Empirical analysis shows what improvements have been made compared to the results based on 0.5 through classification performance metrics.
https://doi.org/10.5351/KJAS.2021.34.5.711 인용 PDF KSCI

Binomial Sampling Plans for the Citrus Red Mite, Panonychus citri(Acari: Tetranychidae) on Satsuma Mandarin Groves in Jeju (온주밀감에서 귤응애의 이항표본조사법 개발)

송정흡;이창훈;강상훈;김동환;강시용;류기중
- Korean journal of applied entomology
- /
- v.40 no.3
- /
- pp.197-202
- /
- 2001
The density of citrus red mite(CRM), Panonychus citri(McGregor), on the commercial satsuma mandarin Citrus unshiu L. groves were determined by counts of the number of CRM per leaf using by leaf sample in Jeju for 2 years. Binomial sampling plans were developed based on the relationship between the mean density per leaf(m) and the proportion of leaf infested with less than T mites per leaf($P_{T}$), according to the empirical model $ln(m)={\alpha}+{\beta}ln(-ln(1-P_{T}))$. T was defined as tally threshold, and set to 1, 3, 5 and 7 mites per leaf in this study. Increasing sample size, regardless of tally threshold, had little effects on the precision of the binomial sampling plan. Increasing sampling size had little effect on the precision of the estimated mean regardless of tally thresholds. T=1 was chosen as the best tally threshold for estimating densities of CRM based on the precision of the model. The binomial model with T=1 provided reliable predictions of mean densities of CRM observed on the commercial satsuma mandarin groves. Binomial sequential sampling procedure were developed for classifying the density of CRM. A binomial sampling program for decision-making CRM population level based on action threshold of 2 mites per leaf was obtained.
PDF

시사특집 - 불확실성 시대, 타이어산업의 혁신 역량 강화 전략

Lee, Hang-Gu
- The tire
- /
- s.246
- /
- pp.10-19
- /
- 2011
일본 대지진으로 인해 세계 경제의 불확실성이 높아가고 있는 가운데 최근 자동차 부품수출이 급증하면서 자동차 부품 산업이 새로운 성장동력 산업으로 부상하고 있다. 우리나라의 타이어 수출은 타이어업체의 해외 생산 확대에도 불구하고 증가하고 있으나, 국내 산업 분류상 타이어가 자동차부품으로 분류되지 않음으로써 관심을 끌지 못하고 있다. 국내 타이어산업은 지속적인 연구개발 투자에 힘입어 혁신역량을 강화하면서 글로벌산업으로 성장하였으며, 세계 시장 점유율을 확대하고 있다. 본고에서는 국내 자동차 부품산업의 수출 현황과 환경 변화 및 국내외 자동차 부품업체의 혁신 전략에 대해 살펴 본 후 불확실성시대의 국내 타이어산업의 혁신 역량 강화 전략을 제시해 보기로 한다.
PDF

Value of Travel-Time Savings in Metropolitan Road Freight Transportation with Freight Classification Code (화물품목 분류에 따른 대도시권 공로화물운송의 시간가치 산정)

최창호
- Journal of Korean Society of Transportation
- /
- v.20 no.7
- /
- pp.167-175
- /
- 2002
The objective of this study is to reveal a shipper's preference for road freight transport according to commodity classification code. The shipper's preference in freight transport can be obtained by using value of travel-time savings. The characteristics of freight are so various that the shipper's preference also appear widely different. In these days, there were few attempts to estimate value of freight travel-time savings in Korea. but most of them included only rail or marine freight transport so it couldn't obtain unique travel-time savings for road freight transport. In this study the value of travel-time savings of road freight transport was estimated according to commodity classification code. Revealed preference method and associated binominal logit models were applied to estimate the value of travel-time savings in transit from a Seoul metropolitan commodity flow survey data in 1998. Data sets were segmented by commodity classification code and nineteen binominal legit models were estimated according to segmented groups. The results of this study showed that the value of freight travel-time savings varied wide ranges from 16,441 won to 66,769 won per hour a vehicle along with commodity classification code.
PDF KSCI

A comparative study of feature screening methods for ultrahigh dimensional multiclass classification (초고차원 다범주분류를 위한 변수선별 방법 비교 연구)

Lee, Kyungeun;Kim, Kyoung Hee;Shin, Seung Jun
- The Korean Journal of Applied Statistics
- /
- v.30 no.5
- /
- pp.793-808
- /
- 2017
We compare various variable screening methods on multiclass classification problems when the data is ultrahigh-dimensional. Two different approaches were considered: (1) pairwise extension from binary classification via one versus one or one versus rest comparisons and (2) direct classification of multiclass responses. We conducted extensive simulation studies under different conditions: heavy tailed explanatory variables, correlated signal and noise variables, correlated joint distributions but uncorrelated marginals, and unbalanced response variables. We then analyzed real data to examine the performance of the methods. The results showed that model-free methods perform better for multiclass classification problems as well as binary ones.
https://doi.org/10.5351/KJAS.2017.30.5.793 인용 PDF KSCI

Rule-Based Classification Analysis Using Entropy Distribution (엔트로피 분포를 이용한 규칙기반 분류분석 연구)

Lee, Jung-Jin;Park, Hae-Ki
- Communications for Statistical Applications and Methods
- /
- v.17 no.4
- /
- pp.527-540
- /
- 2010
Rule-based classification analysis is widely used for massive datamining because it is easy to understand and its algorithm is uncomplicated. In this classification analysis, majority vote of rules or weighted combination of rules using their supports are frequently used in order to combine rules. We propose a method to combine rules by using the multinomial distribution in this paper. Iterative proportional fitting algorithm is used to estimate the multinomial distribution which maximizes entropy constrained on rules' support. Simulation experiments show that this method can compete with other well known classification models in the case of two similar populations.
https://doi.org/10.5351/CKSS.2010.17.4.527 인용 PDF KSCI

부분적으로 균형된 불완비 블럭계획들 간의 퇴화에 관한 연구

배종성
- Communications for Statistical Applications and Methods
- /
- v.2 no.2
- /
- pp.387-394
- /
- 1995
반복수가 같은 이항 블럭계획(binary equireplicate block design)에서 조화 행렬의 구조는 불완비 블럭계획의 분류 및 분석에 사용된다. 조화행렬의 구조에 의하여 몇 가지 상반부류 수가 3인 부분적으로 균형된 불완비 블럭계획이 상반부류 수가 2인 부분적으로 균형된 불완비 블럭계획으로 퇴화되는 상반 조건을 보였다. 또한 처리 수가 6인 삼각형 계획의 상반관계를 이용하여 그룹분해 가능 계획의 배치계획을 구성할 수 있음을 보였다.
PDF

Implementation of A REal-time Endpoint Detection Algorithm Using TMS320C30 (TMS320C30을 이용한 실시간 음성부 검출 알고리즘 구현)

이항섭
- Proceedings of the Acoustical Society of Korea Conference
- /
- 1993.06a
- /
- pp.229-232
- /
- 1993
이 논문은 최근에 개발된 실시간 음성부 검출 알고리즘[1]을 TMS320C30 System board와 IBM PC486을 이용한 implementation에 관한 논문이다. 음성부 검출 알고리즘은 Energy와 LCR(Level Crossing Rate)를 이용하여 각 frame을 음성/묵음으로 분류하는 방법을 사용하였고 DSP 보드를 사용하여 한 frame이 입력되면 다음 frame이 입력되기 전에 그 frame에 대한 음성/묵음 분류를 하여 음성입력이 끝남과 동시에 음성이라고 판단되는 부분만을 DPS moemory상에 저장하므로 불필요한 memory의 낭비를 중이고 다음 단계의 음성처리를 위한 시간을 절약하였다. 이 알고리즘의 성능 평가를 위하여 Rabiner와 Sambur의 알고리즘과 한민수의 알고리즘과를 전문가가 수작업으로 찾아낸 결과와 비교 평가하였다. 알고리즘의 오차는 평균 남성 4.925ms, 여성 5.85ms로 1 frame 이내의 오차를 보였다.
PDF

Novel Intent Category Discovery using Contrastive Learning (대조학습을 활용한 새로운 의도 카테고리 발견)

Seungyeon Seo;Gary Geunbae Lee
- Annual Conference on Human and Language Technology
- /
- 2023.10a
- /
- pp.107-112
- /
- 2023
라벨 데이터 수집의 어려움에 따라 라벨이 없는 데이터로 학습하는 준지도학습, 비지도학습에 대한 연구가 활발하게 진행되고 있다. 본 논문에서는 그의 일환으로 Novel Intent Category Discovery(NICD) 문제를 제안하고 NICD 연구의 베이스라인이 될 모델을 소개한다. NICD 문제는 라벨이 있는 데이터와 라벨이 없는 데이터의 클래스 셋이 겹치지 않는다는 점에서 기존 준지도학습의 문제들과 차이가 있다. 제안 모델은 RoBERTa를 기반으로 두 개의 분류기를 추가하여 구성되며 라벨이 있는 데이터셋과 라벨이 없는 데이터셋에서 각각 다른 분류기를 사용하여 라벨을 예측한다. 학습방법은 2단계로 먼저 라벨이 있는 데이터셋으로 요인표현을 학습한다. 두 번째 단계에서는 교차 엔트로피, 이항교차 엔트로피, 평균제곱오차, 지도 대조 손실함수를 NICD 문제에 맞게 변형하여 학습에 사용한다. 논문에서 제안된 모델은 라벨이 없는 데이터셋에 대해 이미지 최고성능 모델보다 24.74 더 높은 정확도를 기록했다.
PDF

Search Result 50, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)