• Title/Summary/Keyword: Oversampling

Search Result 141, Processing Time 0.025 seconds

Boosting the Performance of the Predictive Model on the Imbalanced Dataset Using SVM Based Bagging and Out-of-Distribution Detection (SVM 기반 Bagging과 OoD 탐색을 활용한 제조공정의 불균형 Dataset에 대한 예측모델의 성능향상)

  • Kim, Jong Hoon;Oh, Hayoung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.11
    • /
    • pp.455-464
    • /
    • 2022
  • There are two unique characteristics of the datasets from a manufacturing process. They are the severe class imbalance and lots of Out-of-Distribution samples. Some good strategies such as the oversampling over the minority class, and the down-sampling over the majority class, are well known to handle the class imbalance. In addition, SMOTE has been chosen to address the issue recently. But, Out-of-Distribution samples have been studied just with neural networks. It seems to be hardly shown that Out-of-Distribution detection is applied to the predictive model using conventional machine learning algorithms such as SVM, Random Forest and KNN. It is known that conventional machine learning algorithms are much better than neural networks in prediction performance, because neural networks are vulnerable to over-fitting and requires much bigger dataset than conventional machine learning algorithms does. So, we suggests a new approach to utilize Out-of-Distribution detection based on SVM algorithm. In addition to that, bagging technique will be adopted to improve the precision of the model.

Performance Comparison of Neural Network and Gradient Boosting Machine for Dropout Prediction of University Students

  • Hyeon Gyu Kim
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.8
    • /
    • pp.49-58
    • /
    • 2023
  • Dropouts of students not only cause financial loss to the university, but also have negative impacts on individual students and society together. To resolve this issue, various studies have been conducted to predict student dropout using machine learning. This paper presents a model implemented using DNN (Deep Neural Network) and LGBM (Light Gradient Boosting Machine) to predict dropout of university students and compares their performance. The academic record and grade data collected from 20,050 students at A University, a small and medium-sized 4-year university in Seoul, were used for learning. Among the 140 attributes of the collected data, only the attributes with a correlation coefficient of 0.1 or higher with the attribute indicating dropout were extracted and used for learning. As learning algorithms, DNN (Deep Neural Network) and LightGBM (Light Gradient Boosting Machine) were used. Our experimental results showed that the F1-scores of DNN and LGBM were 0.798 and 0.826, respectively, indicating that LGBM provided 2.5% better prediction performance than DNN.

Sparse Class Processing Strategy in Image-based Livestock Defect Detection (이미지 기반 축산물 불량 탐지에서의 희소 클래스 처리 전략)

  • Lee, Bumho;Cho, Yesung;Yi, Mun Yong
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.11
    • /
    • pp.1720-1728
    • /
    • 2022
  • The industrial 4.0 era has been opened with the development of artificial intelligence technology, and the realization of smart farms incorporating ICT technology is receiving great attention in the livestock industry. Among them, the quality management technology of livestock products and livestock operations incorporating computer vision-based artificial intelligence technology represent key technologies. However, the insufficient number of livestock image data for artificial intelligence model training and the severely unbalanced ratio of labels for recognizing a specific defective state are major obstacles to the related research and technology development. To overcome these problems, in this study, combining oversampling and adversarial case generation techniques is proposed as a method necessary to effectively utilizing small data labels for successful defect detection. In addition, experiments comparing performance and time cost of the applicable techniques were conducted. Through experiments, we confirm the validity of the proposed methods and draw utilization strategies from the study results.

Data Sampling Using Oversampling Technique for Estimating Two-Dimensional Dispersion Coefficients (2차원 분산계수 경험식 산정을 위한 오버샘플링 기법 활용 데이터 샘플링)

  • Lee, Sun Mi;Park, In Hwan
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.449-449
    • /
    • 2021
  • 하천 내 오염물질 유입원은 하수처리장과 같이 농도를 예측 가능한 점오염원이 일반적이지만, 수질오염사고와 같이 다량의 유해물질이 일시에 하천에 유입되는 경우도 발생하곤 한다. 특히 오염물질 유입지점과 취수장이 인접한 경우, 오염물질 혼합해석에 대한 이해가 오염사고 대응 및 수질 관리 측면에서 매우 중요하다. 자연하천에서는 사행에 따른 유속 구조의 불균일성 등으로 인하여 오염물질의 이송 및 분산 과정은 매우 복잡하게 나타난다. 이러한 하천의 지형적, 수리학적 특성이 오염물질의 혼합 거동에 미치는 영향을 정확하게 모의하기 위해서는 3차원 수치모형을 적용해야 한다. 그러나 대부분의 하천은 하폭 대 수심비가 매우 크기 때문에 2차원 이송-분산 방정식을 지배방정식으로 채택하는 2차원 수치 모형이 널리 사용되어왔다. 2차원 이송-분산 방정식의 해석결과는 입력된 종, 횡 분산계수의 값에 따라 변화하기 때문에 정확한 혼합해석을 위해 분산계수의 결정이 매우 중요하다. 과거 연구에서는 횡 분산계수의 결정을 위해 기본 수리량을 이용한 경험식을 활용하여 계산한 바 있다. 종 분산계수의 경우에는 경험식의 산정에 필요한 충분한 실험 자료가 축적되어 있지 않아 이상적 흐름 상태를 가정하여 유도된 Elder의 이론식(Elder, 1959)을 사용해왔다. 하지만 많은 연구에서 이러한 Elder의 이론식이 종 분산계수를 과소산정 할 우려가 있다고 제시했다. 따라서 하천의 전단류 분산특성을 나타낼 수 있는 데이터 확보를 통해 종 분산계수의 경험식 산정 및 횡 분산계수의 정확도 향상이 필요한 상황이다. 본 연구에서는 기존 선행 연구에서 수행된 2차원 추적자실험 데이터의 확장을 위해 오버샘플링 기법을 적용하였으며, 이를 통한 머신러닝을 통한 분산계수 산정 가능성을 분석하고자 한다. 부족한 추적자 실험 데이터를 확장하기 위해 오버샘플링 기법 중 SMOTE 기법을 활용했다. 오버샘플링 기법을 이용하여 생산된 데이터의 신뢰성을 검증하였으며, 추후 머신러닝을 이용한 2차원 종, 횡 분산계수 산정에 대한 활용 가능성을 분석했다.

  • PDF

Application and Comparison of Data Mining Technique to Prevent Metal-Bush Omission (메탈부쉬 누락예방을 위한 데이터마이닝 기법의 적용 및 비교)

  • Sang-Hyun Ko;Dongju Lee
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.3
    • /
    • pp.139-147
    • /
    • 2023
  • The metal bush assembling process is a process of inserting and compressing a metal bush that serves to reduce the occurrence of noise and stable compression in the rotating section. In the metal bush assembly process, the head diameter defect and placement defect of the metal bush occur due to metal bush omission, non-pressing, and poor press-fitting. Among these causes of defects, it is intended to prevent defects due to omission of the metal bush by using signals from sensors attached to the facility. In particular, a metal bush omission is predicted through various data mining techniques using left load cell value, right load cell value, current, and voltage as independent variables. In the case of metal bush omission defect, it is difficult to get defect data, resulting in data imbalance. Data imbalance refers to a case where there is a large difference in the number of data belonging to each class, which can be a problem when performing classification prediction. In order to solve the problem caused by data imbalance, oversampling and composite sampling techniques were applied in this study. In addition, simulated annealing was applied for optimization of parameters related to sampling and hyper-parameters of data mining techniques used for bush omission prediction. In this study, the metal bush omission was predicted using the actual data of M manufacturing company, and the classification performance was examined. All applied techniques showed excellent results, and in particular, the proposed methods, the method of mixing Random Forest and SA, and the method of mixing MLP and SA, showed better results.

Classification Abnormal temperatures based on Meteorological Environment using Random forests (랜덤포레스트를 이용한 기상 환경에 따른 이상기온 분류)

  • Youn Su Kim;Kwang Yoon Song;In Hong Chang
    • Journal of Integrative Natural Science
    • /
    • v.17 no.1
    • /
    • pp.1-12
    • /
    • 2024
  • Many abnormal climate events are occurring around the world. The cause of abnormal climate is related to temperature. Factors that affect temperature include excessive emissions of carbon and greenhouse gases from a global perspective, and air circulation from a local perspective. Due to the air circulation, many abnormal climate phenomena such as abnormally high temperature and abnormally low temperature are occurring in certain areas, which can cause very serious human damage. Therefore, the problem of abnormal temperature should not be approached only as a case of climate change, but should be studied as a new category of climate crisis. In this study, we proposed a model for the classification of abnormal temperature using random forests based on various meteorological data such as longitudinal observations, yellow dust, ultraviolet radiation from 2018 to 2022 for each region in Korea. Here, the meteorological data had an imbalance problem, so the imbalance problem was solved by oversampling. As a result, we found that the variables affecting abnormal temperature are different in different regions. In particular, the central and southern regions are influenced by high pressure (Mainland China, Siberian high pressure, and North Pacific high pressure) due to their regional characteristics, so pressure-related variables had a significant impact on the classification of abnormal temperature. This suggests that a regional approach can be taken to predict abnormal temperatures from the surrounding meteorological environment. In addition, in the event of an abnormal temperature, it seems that it is possible to take preventive measures in advance according to regional characteristics.

Design of a Inverter-Based 3rd Order ΔΣ Modulator Using 1.5bit Comparators (1.5비트 비교기를 이용한 인버터 기반 3차 델타-시그마 변조기)

  • Choi, Jeong Hoon;Seong, Jae Hyeon;Yoon, Kwang Sub
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.53 no.7
    • /
    • pp.39-46
    • /
    • 2016
  • This paper describes the third order feedforward delta-sigma modulator with inverter-based integrators and a 1.5bit comparator for the application of audio signal processing. The proposed 3rd-order delta-sigma modulator is multi-bit structure using 1.5 bit comparator instead of operational amplifier. This delta-sigma modulator has high SNR compared with single-bit 4th-order delta-sigma modulator in a low OSR. And it minimizes power consumes and simplified circuit structure using inverter-based integrator and using inverter-based integrator as analogue adder. The modulator was designed with 0.18um CMOS standard process and total chip area is $0.36mm^2$. The measured power cosumption is 28.8uW in a 0.8V analog supply and 66.6uW in a 1.8V digital supply. The measurement result shows that the peak SNDR of 80.7 dB, the ENOB of 13.1bit and the dynamic range of 86.1 dB with an input signal frequency of 2.5kHz, a sampling frequency of 2.56MHz and an oversampling rate of 64. The FOM (Walden) from the measurement result is 269 fJ/step, FOM (Schreier) was calculated as 169.3 dB.

An I/O Interface Circuit Using CTR Code to Reduce Number of I/O Pins (CTR 코드를 사용한 I/O 핀 수를 감소 시킬 수 있는 인터페이스 회로)

  • Kim, Jun-Bae;Kwon, Oh-Kyong
    • Journal of the Korean Institute of Telematics and Electronics D
    • /
    • v.36D no.1
    • /
    • pp.47-56
    • /
    • 1999
  • As the density of logic gates of VLSI chips has rapidly increased, more number of I/O pins has been required. This results in bigger package size and higher packager cost. The package cost is higher than the cost of bare chips for high I/O count VLSI chips. As the density of logic gates increases, the reduction method of the number of I/O pins for a given complexity of logic gates is required. In this paper, we propose the novel I/O interface circuit using CTR (Constant-Transition-Rate) code to reduce 50% of the number of I/O pins. The rising and falling edges of the symbol pulse of CTR codes contain 2-bit digital data, respectively. Since each symbol of the proposed CTR codes contains 4-bit digital data, the symbol rate can be reduced by the factor of 2 compared with the conventional I/O interface circuit. Also, the simultaneous switching noise(SSN) can be reduced because the transition rate is constant and the transition point of the symbols is widely distributed. The channel encoder is implemented only logic circuits and the circuit of the channel decoder is designed using the over-sampling method. The proper operation of the designed I/O interface circuit was verified using. HSPICE simulation with 0.6 m CMOS SPICE parameters. The simulation results indicate that the data transmission rate of the proposed circuit using 0.6 m CMOS technology is more than 200 Mbps/pin. We implemented the proposed circuit using Altera's FPGA and confimed the operation with the data transfer rate of 22.5 Mbps/pin.

  • PDF

Design of digital decimation filter for sigma-delta A/D converters (시그마-델타 A/D 컨버터용 디지털 데시메이션 필터 설계)

  • Byun, San-Ho;Ryu, Seong-Young;Choi, Young-Kil;Roh, Hyung-Dong;Nam, Hyun-Seok;Roh, Jeong-Jin
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.44 no.2
    • /
    • pp.34-45
    • /
    • 2007
  • Digital decimation filter is inevitable in oversampled sigma-delta A/D converters for the sake of reducing the oversampled rate to Nyquist rate. This paper presented a Verilog-HDL design and implementation of an area-efficient digital decimation filter that provides time-to-market advantage for sigma-delta analog-to-digital converters. The digital decimation filter consists of CIC(cascaded integrator-comb) filter and two cascaded half-band FIR filters. A CSD(canonical signed digit) representation of filter coefficients is used to minimize area and reduce in hardware complexity of multiplication arithmetic. Coefficient multiplications are implemented by using shifters and adders. This three-stage decimation filter is fabricated in $0.25-{\mu}m$ CMOS technology and incorporates $1.36mm^2$ of active area, shows 4.4 mW power consumption at clock rate of 2.8224 MHz. Measured results show that this digital decimation filter is suitable for digital audio decimation filters.

A 3.2Gb/s Clock and Data Recovery Circuit without Reference Clock for Serial Data Communication (시리얼 데이터 통신을 위한 기준 클록이 없는 3.2Gb/s 클록 데이터 복원회로)

  • Kim, Kang-Jik;Jung, Ki-Sang;Cho, Seong-Ik
    • Journal of the Institute of Electronics Engineers of Korea SC
    • /
    • v.46 no.2
    • /
    • pp.72-77
    • /
    • 2009
  • In this paper, a 3.2Gb/s clock and data recovery (CDR) circuit for a high-speed serial data communication without the reference clock is described This CDR circuit consists of 5 parts as Phase and frequency detector(PD and FD), multi-phase Voltage Controlled-Oscillator(VCO), Charge-pumps (CP) and external Loop-Filter(KF). It is adapted the PD and FD, which incorporates a half-rate bang-bang type oversampling PD and a half-rate FD that can improve pull-in range. The VCO consists of four fully differential delay cells with rail-to-rail current bias scheme that can increase the tuning range and tuning linearity. Each delay cell has output buffers as a full-swing generator and a duty-cycle mismatch compensation. This materialized CDR can achieve wide pull-in range without an extra reference clock and it can be also reduced chip area and power consumption effectively because there is no additional Phase Locked- Loop(PLL) for generating reference clock. The CDR circuit was designed for fabrication using 0.18um 1P6M CMOS process and total chip area excepted LF is $1{\times}1mm^2$. The pk-pk jitter of recovered clock is 26ps at 3.2Gb/s input data rate and total power consumes 63mW from 1.8V supply voltage according to simulation results. According to test result, the pk-pk jitter of recovered clock is 55ps at the same input data-rate and the reliable range of input data-rate is about from 2.4Gb/s to 3.4Gb/s.