Search | Korea Science

Improving minority prediction performance of support vector machine for imbalanced text data via feature selection and SMOTE (단어선택과 SMOTE 알고리즘을 이용한 불균형 텍스트 데이터의 소수 범주 예측성능 향상 기법)

Jongchan Kim;Seong Jun Chang;Won Son
- The Korean Journal of Applied Statistics
- /
- v.37 no.4
- /
- pp.395-410
- /
- 2024
Text data is usually made up of a wide variety of unique words. Even in standard text data, it is common to find tens of thousands of different words. In text data analysis, usually, each unique word is treated as a variable. Thus, text data can be regarded as a dataset with a large number of variables. On the other hand, in text data classification, we often encounter class label imbalance problems. In the cases of substantial imbalances, the performance of conventional classification models can be severely degraded. To improve the classification performance of support vector machines (SVM) for imbalanced data, algorithms such as the Synthetic Minority Over-sampling Technique (SMOTE) can be used. The SMOTE algorithm synthetically generates new observations for the minority class based on the k-Nearest Neighbors (kNN) algorithm. However, in datasets with a large number of variables, such as text data, errors may accumulate. This can potentially impact the performance of the kNN algorithm. In this study, we propose a method for enhancing prediction performance for the minority class of imbalanced text data. Our approach involves employing variable selection to generate new synthetic observations in a reduced space, thereby improving the overall classification performance of SVM.
https://doi.org/10.5351/KJAS.2024.37.4.395 인용 PDF

Recognizing Chord Symbols in Printed Korean Musical Images Using Lexicon-Driven Approach

Dinh, Minh;Yang, Hyung-Jeong;Lee, Guee-Sang;Kim, Soo-Hyung;Na, In-Seop
- Proceedings of the Korea Contents Association Conference
- /
- 2015.05a
- /
- pp.53-54
- /
- 2015
Optical music recognition (OMR) systems have been developed in recent years. However, chord symbols that play a role in a music sheet have been still disregarded. Therefore, we aimed to develop a proper approach to recognize these chord symbols. First, we divide the image of chord symbol into small segments in horizontal by a method based on vertical projection. Then, the optimal combination of these segments is found by using a lexicon-driven word scoring technique and a nearest neighbor classifier. The word that corresponds to the optimal combination is the result of recognition. The experiment gives an impressive result with accuracy 97.32%.
PDF

Shape Feature Extraction technique for Content-Based Image Retrieval in Multimedia Databases

Kim, Byung-Gon;Han, Joung-Woon;Lee, Jaeho;Haechull Lim
- Proceedings of the IEEK Conference
- /
- 2000.07b
- /
- pp.869-872
- /
- 2000
Although many content-based image retrieval systems using shape feature have tried to cover rotation-, position- and scale-invariance between images, there have been problems to cover three kinds of variance at the same time. In this paper, we introduce new approach to extract shape feature from image using MBR(Minimum Bounding Rectangle). The proposed method scans image for extracting MBR information and, based on MBR information, compute contour information that consists of 16 points. The extracted information is converted to specific values by normalization and rotation. The proposed method can cover three kinds of invariance at the same time. We implemented our method and carried out experiments. We constructed R*_tree indexing structure, perform k-nearest neighbor search from query image, and demonstrate the capability and usefulness of our method.
PDF

Privacy Protection Model for Location-Based Services

Ni, Lihao;Liu, Yanshen;Liu, Yi
- Journal of Information Processing Systems
- /
- v.16 no.1
- /
- pp.96-112
- /
- 2020
Solving the disclosure problem of sensitive information with the k-nearest neighbor query, location dummy technique, or interfering data in location-based services (LBSs) is a new research topic. Although they reduced security threats, previous studies will be ineffective in the case of sparse users or K-successive privacy, and additional calculations will deteriorate the performance of LBS application systems. Therefore, a model is proposed herein, which is based on geohash-encoding technology instead of latitude and longitude, memcached server cluster, encryption and decryption, and authentication. Simulation results based on PHP and MySQL show that the model offers approximately 10× speedup over the conventional approach. Two problems are solved using the model: sensitive information in LBS application is not disclosed, and the relationship between an individual and a track is not leaked.
https://doi.org/10.3745/JIPS.04.0163 인용 PDF KSCI

A Study on the Data Fusion Method using Decision Rule for Data Enrichment (의사결정 규칙을 이용한 데이터 통합에 관한 연구)

Kim S.Y.;Chung S.S.
- The Korean Journal of Applied Statistics
- /
- v.19 no.2
- /
- pp.291-303
- /
- 2006
Data mining is the work to extract information from existing data file. So, the one of best important thing in data mining process is the quality of data to be used. In this thesis, we propose the data fusion technique using decision rule for data enrichment that one phase to improve data quality in KDD process. Simulations were performed to compare the proposed data fusion technique with the existing techniques. As a result, our data fusion technique using decision rule is characterized with low MSE or misclassification rate in fusion variables.
https://doi.org/10.5351/KJAS.2006.19.2.291 인용 PDF KSCI

Network Anomaly Detection using Hybrid Feature Selection

Kim Eun-Hye;Kim Se-Hun
- Proceedings of the Korea Institutes of Information Security and Cryptology Conference
- /
- 2006.06a
- /
- pp.649-653
- /
- 2006
In this paper, we propose a hybrid feature extraction method in which Principal Components Analysis is combined with optimized k-Means clustering technique. Our approach hierarchically reduces the redundancy of features with high explanation in principal components analysis for choosing a good subset of features critical to improve the performance of classifiers. Based on this result, we evaluate the performance of intrusion detection by using Support Vector Machine and a nonparametric approach based on k-Nearest Neighbor over data sets with reduced features. The Experiment results with KDD Cup 1999 dataset show several advantages in terms of computational complexity and our method achieves significant detection rate which shows possibility of detecting successfully attacks.
PDF

A Classification Method Using Data Reduction

Uhm, Daiho;Jun, Sung-Hae;Lee, Seung-Joo
- International Journal of Fuzzy Logic and Intelligent Systems
- /
- v.12 no.1
- /
- pp.1-5
- /
- 2012
Data reduction has been used widely in data mining for convenient analysis. Principal component analysis (PCA) and factor analysis (FA) methods are popular techniques. The PCA and FA reduce the number of variables to avoid the curse of dimensionality. The curse of dimensionality is to increase the computing time exponentially in proportion to the number of variables. So, many methods have been published for dimension reduction. Also, data augmentation is another approach to analyze data efficiently. Support vector machine (SVM) algorithm is a representative technique for dimension augmentation. The SVM maps original data to a feature space with high dimension to get the optimal decision plane. Both data reduction and augmentation have been used to solve diverse problems in data analysis. In this paper, we compare the strengths and weaknesses of dimension reduction and augmentation for classification and propose a classification method using data reduction for classification. We will carry out experiments for comparative studies to verify the performance of this research.
https://doi.org/10.5391/IJFIS.2012.12.1.1 인용 PDF KSCI

Personal Identification Using Teeth Images

Kim Tae-Woo;Cho Tae-Kyung;Park Byoung-Soo;Lee Myung-Wook
- Proceedings of the IEEK Conference
- /
- summer
- /
- pp.435-437
- /
- 2004
This paper presents a personal identification method using teeth images. The method uses images for teeth expressions of anterior and posterior occlusion state and LDA-based technique. Teeth images give merits for recognition because teeth, rigid objects, cannot be deformed at the moment of image acquisition. In the experiments, personal identification for 12 people was successful. It was shown that our method can contribute to multi-modal authentication systems.
PDF

A Method of Fast Track Merging for Multi-Target Tracking under Heavy Clutter Environment (복잡한 환경에서 다중표적추적을 위한 고속 트랙병합 기법)

Lee, Seung-Youn;Yoon, Joo-Hong;Lee, Seok-Jae;Jung, Young-Hun;Choe, Tok-Son
- Journal of the Korea Institute of Military Science and Technology
- /
- v.15 no.4
- /
- pp.513-518
- /
- 2012
In this paper, we proposed a method of fast track merging which is the foundation of track to track association technique. The existing method of track merging is performed throughout comparison between tracks to tracks. Therefore, it has heavy calculation time. In our research, we developed a method for fast clustering by using nearest neighbor measurement identification. The simulation results show that the proposed method is more faster than previous method about 3.3%. We expect that this method could be effectively used in multi-target tracking particularly in heavy clutter environment.
https://doi.org/10.9766/KIMST.2012.15.4.513 인용 PDF KSCI

The Rotational Motion Stabilization Using Simple Estimation of the Rotation Center and Angle

Seok, Ho-Dong;Kim, Do-Jong;Lyou, Joon
- 제어로봇시스템학회:학술대회논문집
- /
- 2003.10a
- /
- pp.231-236
- /
- 2003
This paper presents a simple approach on the rotational motion estimation and correction for the roll stabilization of the sight system. The algorithm first computes the rotational center from the selected local velocity vectors of related pixels by least square methods. And then, rotational angle is found from the special subset of the motion vector. Finally, motion correction is performed by the nearest neighbor interpolation technique. In order to show the performance of the algorithm, the evaluation for the synthetic and real image was performed. The test results show good performance compared with previous approach.
PDF

Search Result 78, Processing Time 0.024 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)