Search | Korea Science

A Hybrid SVM Classifier for Imbalanced Data Sets (불균형 데이터 집합의 분류를 위한 하이브리드 SVM 모델)

Lee, Jae Sik;Kwon, Jong Gu
- Journal of Intelligence and Information Systems
- /
- v.19 no.2
- /
- pp.125-140
- /
- 2013
We call a data set in which the number of records belonging to a certain class far outnumbers the number of records belonging to the other class, 'imbalanced data set'. Most of the classification techniques perform poorly on imbalanced data sets. When we evaluate the performance of a certain classification technique, we need to measure not only 'accuracy' but also 'sensitivity' and 'specificity'. In a customer churn prediction problem, 'retention' records account for the majority class, and 'churn' records account for the minority class. Sensitivity measures the proportion of actual retentions which are correctly identified as such. Specificity measures the proportion of churns which are correctly identified as such. The poor performance of the classification techniques on imbalanced data sets is due to the low value of specificity. Many previous researches on imbalanced data sets employed 'oversampling' technique where members of the minority class are sampled more than those of the majority class in order to make a relatively balanced data set. When a classification model is constructed using this oversampled balanced data set, specificity can be improved but sensitivity will be decreased. In this research, we developed a hybrid model of support vector machine (SVM), artificial neural network (ANN) and decision tree, that improves specificity while maintaining sensitivity. We named this hybrid model 'hybrid SVM model.' The process of construction and prediction of our hybrid SVM model is as follows. By oversampling from the original imbalanced data set, a balanced data set is prepared. SVM_I model and ANN_I model are constructed using the imbalanced data set, and SVM_B model is constructed using the balanced data set. SVM_I model is superior in sensitivity and SVM_B model is superior in specificity. For a record on which both SVM_I model and SVM_B model make the same prediction, that prediction becomes the final solution. If they make different prediction, the final solution is determined by the discrimination rules obtained by ANN and decision tree. For a record on which SVM_I model and SVM_B model make different predictions, a decision tree model is constructed using ANN_I output value as input and actual retention or churn as target. We obtained the following two discrimination rules: 'IF ANN_I output value <0.285, THEN Final Solution = Retention' and 'IF ANN_I output value ${\geq}0.285$, THEN Final Solution = Churn.' The threshold 0.285 is the value optimized for the data used in this research. The result we present in this research is the structure or framework of our hybrid SVM model, not a specific threshold value such as 0.285. Therefore, the threshold value in the above discrimination rules can be changed to any value depending on the data. In order to evaluate the performance of our hybrid SVM model, we used the 'churn data set' in UCI Machine Learning Repository, that consists of 85% retention customers and 15% churn customers. Accuracy of the hybrid SVM model is 91.08% that is better than that of SVM_I model or SVM_B model. The points worth noticing here are its sensitivity, 95.02%, and specificity, 69.24%. The sensitivity of SVM_I model is 94.65%, and the specificity of SVM_B model is 67.00%. Therefore the hybrid SVM model developed in this research improves the specificity of SVM_B model while maintaining the sensitivity of SVM_I model.
https://doi.org/10.13088/jiis.2013.19.2.125 인용 PDF KSCI

Financial Fraud Detection using Text Mining Analysis against Municipal Cybercriminality (지자체 사이버 공간 안전을 위한 금융사기 탐지 텍스트 마이닝 방법)

Choi, Sukjae;Lee, Jungwon;Kwon, Ohbyung
- Journal of Intelligence and Information Systems
- /
- v.23 no.3
- /
- pp.119-138
- /
- 2017
Recently, SNS has become an important channel for marketing as well as personal communication. However, cybercrime has also evolved with the development of information and communication technology, and illegal advertising is distributed to SNS in large quantity. As a result, personal information is lost and even monetary damages occur more frequently. In this study, we propose a method to analyze which sentences and documents, which have been sent to the SNS, are related to financial fraud. First of all, as a conceptual framework, we developed a matrix of conceptual characteristics of cybercriminality on SNS and emergency management. We also suggested emergency management process which consists of Pre-Cybercriminality (e.g. risk identification) and Post-Cybercriminality steps. Among those we focused on risk identification in this paper. The main process consists of data collection, preprocessing and analysis. First, we selected two words 'daechul(loan)' and 'sachae(private loan)' as seed words and collected data with this word from SNS such as twitter. The collected data are given to the two researchers to decide whether they are related to the cybercriminality, particularly financial fraud, or not. Then we selected some of them as keywords if the vocabularies are related to the nominals and symbols. With the selected keywords, we searched and collected data from web materials such as twitter, news, blog, and more than 820,000 articles collected. The collected articles were refined through preprocessing and made into learning data. The preprocessing process is divided into performing morphological analysis step, removing stop words step, and selecting valid part-of-speech step. In the morphological analysis step, a complex sentence is transformed into some morpheme units to enable mechanical analysis. In the removing stop words step, non-lexical elements such as numbers, punctuation marks, and double spaces are removed from the text. In the step of selecting valid part-of-speech, only two kinds of nouns and symbols are considered. Since nouns could refer to things, the intent of message is expressed better than the other part-of-speech. Moreover, the more illegal the text is, the more frequently symbols are used. The selected data is given 'legal' or 'illegal'. To make the selected data as learning data through the preprocessing process, it is necessary to classify whether each data is legitimate or not. The processed data is then converted into Corpus type and Document-Term Matrix. Finally, the two types of 'legal' and 'illegal' files were mixed and randomly divided into learning data set and test data set. In this study, we set the learning data as 70% and the test data as 30%. SVM was used as the discrimination algorithm. Since SVM requires gamma and cost values as the main parameters, we set gamma as 0.5 and cost as 10, based on the optimal value function. The cost is set higher than general cases. To show the feasibility of the idea proposed in this paper, we compared the proposed method with MLE (Maximum Likelihood Estimation), Term Frequency, and Collective Intelligence method. Overall accuracy and was used as the metric. As a result, the overall accuracy of the proposed method was 92.41% of illegal loan advertisement and 77.75% of illegal visit sales, which is apparently superior to that of the Term Frequency, MLE, etc. Hence, the result suggests that the proposed method is valid and usable practically. In this paper, we propose a framework for crisis management caused by abnormalities of unstructured data sources such as SNS. We hope this study will contribute to the academia by identifying what to consider when applying the SVM-like discrimination algorithm to text analysis. Moreover, the study will also contribute to the practitioners in the field of brand management and opinion mining.
https://doi.org/10.13088/jiis.2017.23.3.119 인용 PDF KSCI

Evaluating the Land Surface Characterization of High-Resolution Middle-Infrared Data for Day and Night Time (고해상도 중적외선 영상자료의 주야간 지표면 식별 특성 평가)

Baek, Seung-Gyun;Jang, Dong-Ho
- Journal of the Korean Association of Geographic Information Studies
- /
- v.15 no.2
- /
- pp.113-125
- /
- 2012
This research is aimed at evaluating the land surface characterization of KOMPSAT-3A middle infrared (MIR) data. Airborne Hyperspectral Scanner (AHS) data, which has MIR bands with high spatial resolution, were used to assess land surface temperature (LST) retrieval and classification accuracy of MIR bands. Firstly, LST values for daytime and nighttime, which were calculated with AHS thermal infrared (TIR) bands, were compared to digital number of AHS MIR bands. The determination coefficient of AHS band 68 (center wavelength $4.64{\mu}m$) was over 0.74, and was higher than other MIR bands. Secondly, The land cover maps were generated by unsupervised classification methods using the AHS MIR bands. Each class of land cover maps for daytime, such as water, trees, green grass, roads, roofs, was distinguished well. But some classes of land cover maps for nighttime, such as trees versus green grass, roads versus roofs, were not separated. The image classification using the difference images between daytime AHS MIR bands and nighttime AHS MIR bands were conducted to enhance the discrimination ability of land surface for AHS MIR imagery. The classification accuracy of the land cover map for zone 1 and zone 2 was 67.5%, 64.3%, respectively. It was improved by 10% compared to land cover map of daytime AHS MIR bands and night AHS MIR bands. Consequently, new algorithm based on land surface characteristics is required for temperature retrieval of high resolution MIR imagery, and the difference images between daytime and nighttime was considered to enhance the ability of land surface characterization using high resolution MIR data.
https://doi.org/10.11108/kagis.2012.15.2.113 인용 PDF KSCI

Feature Extraction and Classification of Multi-temporal SAR Data Using 3D Wavelet Transform (3차원 웨이블렛 변환을 이용한 다중시기 SAR 영상의 특징 추출 및 분류)

Yoo, Hee Young;Park, No-Wook;Hong, Sukyoung;Lee, Kyungdo;Kim, Yihyun
- Korean Journal of Remote Sensing
- /
- v.29 no.5
- /
- pp.569-579
- /
- 2013
In this study, land-cover classification was implemented using features extracted from multi-temporal SAR data through 3D wavelet transform and the applicability of the 3D wavelet transform as a feature extraction approach was evaluated. The feature extraction stage based on 3D wavelet transform was first carried out before the classification and the extracted features were used as input for land-cover classification. For a comparison purpose, original image data without the feature extraction stage and Principal Component Analysis (PCA) based features were also classified. Multi-temporal Radarsat-1 data acquired at Dangjin, Korea was used for this experiment and five land-cover classes including paddy fields, dry fields, forest, water, and built up areas were considered for classification. According to the discrimination capability analysis, the characteristics of dry field and forest were similar, so it was very difficult to distinguish these two classes. When using wavelet-based features, classification accuracy was generally improved except built-up class. Especially the improvement of accuracy for dry field and forest classes was achieved. This improvement may be attributed to the wavelet transform procedure decomposing multi-temporal data not only temporally but also spatially. This experiment result shows that 3D wavelet transform would be an effective tool for feature extraction from multi-temporal data although this procedure should be tested to other sensors or other areas through extensive experiments.
https://doi.org/10.7780/kjrs.2013.29.5.12 인용 PDF KSCI

Rapid and Precise Determination of Pb Isotope Ratios Using Mu1ti-Collector ICP/MS (다검출기 유도결합 플라즈마 질량분석기를 이용한 신속하고 정밀한 Pb 동위원소 분석)

최만식;정창식;신형선;임태선
- The Journal of the Petrological Society of Korea
- /
- v.10 no.3
- /
- pp.157-171
- /
- 2001
This study investigated the effects of Pb/Tl ratio, Pb concentration and concomitant matrix elements on the measurement of Pb isotope ratios using multi-collector ICP/MS (AXIOM MC model). Accuracy and reproducibility of Pb isotope ratios in NBS 981 solution were estimated for 42 data measured from March to August 2001. Pb isotopes measured in rocks, bronzes and sediments were compared to data measured by TIMS. Reproducibilities for $^{206}Pb/^{204}Pb,\; ^{207}Pb/^{204}Pb,\;and\;^{208}Pb/^{204}Pb$ ratio were about 500 ppm (2sd) and for $^{207}Pb/^{206}Pb$\;and\;^{208}Pb/^{206}Pb$ were 100~200 ppm for 200 ng of Pb in NBS 981 solution. The optimum conditions for the analysis of Pb isotope ratios with AXIOM MC for best accuracy and reproducibility were defined as follows; 1) Pb/Tl ratio is about 10 2) Pb concentration is about 100 ng/ml 3) correction for mass discrimination is performed by exponential law using 2.3887 of $^{205}Tl/^{203}Tl$ and Pb mass fractionation factor empirically obtained from $ln(^{208}Pb/^{206}Pb)-ln(^{205}Tl/^{203}Tl)$ relationship. The sample data measured with MC/ICP/MS for acid-digested and chemically separated rock samples, and acid-digested bronze samples and sediment samples coincide with those of TIMS within analytical errors. Therefore, MC/ICP/MS is a rapid analytical technique for Pb isotope ratios with the similar precision compared with TIMS.
PDF

EEG Signal Classification Algorithm based on DWT and SVM for Driving Robot Control (주행로봇제어를 위한 DWT와 SVM기반의 EEG신호 분류 알고리즘)

Lee, Kibae;Lee, Chong Hyun;Bae, Jinho;Lee, Jaeil
- Journal of the Institute of Electronics and Information Engineers
- /
- v.52 no.8
- /
- pp.117-125
- /
- 2015
In this paper, we propose a classification algorithm based on the obtained EEG(Electroencephalogram) signal for the control of 'left' and 'right' turnings of which a driving system composed of EEG sensor, Labview, DAQ, Matlab and driving robot. The proposed algorithm uses features extracted from frequency band information obtained by DWT (Discrete Wavelet Transform) and selects features of high discrimination by using Fisher score. We, also propose the number of feature vectors for the best classification performance by using SVM(Support Vector Machine) classifier and propose a decision pending algorithm based on MLD (Maximum Likelihood Decision) to prevent malfunction due to misclassification. The selected four feature vectors for the proposed algorithm are the mean of absolute value of voltage and the standard deviation of d5(2-4Hz) and d2(16-32Hz) frequency bands of P8 channel according to the international standard electrode placement method. By using the SVM classifier, we obtained 98.75% accuracy and 1.25% error rate. Also, when we specify error probability of 70% for decision pending, we obtained 95.63% accuracy and 0% error rate by using the proposed decision pending algorithm.
https://doi.org/10.5573/ieie.2015.52.8.117 인용 PDF KSCI

Comparative Analysis of Anomaly Detection Models using AE and Suggestion of Criteria for Determining Outliers

Kang, Gun-Ha;Sohn, Jung-Mo;Sim, Gun-Wu
- Journal of the Korea Society of Computer and Information
- /
- v.26 no.8
- /
- pp.23-30
- /
- 2021
In this study, we present a comparative analysis of major autoencoder(AE)-based anomaly detection methods for quality determination in the manufacturing process and a new anomaly discrimination criterion. Due to the characteristics of manufacturing site, anomalous instances are few and their types greatly vary. These properties degrade the performance of an AI-based anomaly detection model using the dataset for both normal and anomalous cases, and incur a lot of time and costs in obtaining additional data for performance improvement. To solve this problem, the studies on AE-based models such as AE and VAE are underway, which perform anomaly detection using only normal data. In this work, based on Convolutional AE, VAE, and Dilated VAE models, statistics on residual images, MSE, and information entropy were selected as outlier discriminant criteria to compare and analyze the performance of each model. In particular, the range value applied to the Convolutional AE model showed the best performance with AUC PRC 0.9570, F1 Score 0.8812 and AUC ROC 0.9548, accuracy 87.60%. This shows a performance improvement of an accuracy about 20%P(Percentage Point) compared to MSE, which was frequently used as a standard for determining outliers, and confirmed that model performance can be improved according to the criteria for determining outliers.
https://doi.org/10.9708/jksci.2021.26.08.023 인용 PDF KSCI HTML

Qualitative and Quantitative Magnetic Resonance Imaging Phenotypes May Predict CDKN2A/B Homozygous Deletion Status in Isocitrate Dehydrogenase-Mutant Astrocytomas: A Multicenter Study

Yae Won Park;Ki Sung Park;Ji Eun Park;Sung Soo Ahn;Inho Park;Ho Sung Kim;Jong Hee Chang;Seung-Koo Lee;Se Hoon Kim
- Korean Journal of Radiology
- /
- v.24 no.2
- /
- pp.133-144
- /
- 2023
Objective: Cyclin-dependent kinase inhibitor (CDKN)2A/B homozygous deletion is a key molecular marker of isocitrate dehydrogenase (IDH)-mutant astrocytomas in the 2021 World Health Organization. We aimed to investigate whether qualitative and quantitative MRI parameters can predict CDKN2A/B homozygous deletion status in IDH-mutant astrocytomas. Materials and Methods: Preoperative MRI data of 88 patients (mean age ± standard deviation, 42.0 ± 11.9 years; 40 females and 48 males) with IDH-mutant astrocytomas (76 without and 12 with CDKN2A/B homozygous deletion) from two institutions were included. A qualitative imaging assessment was performed. Mean apparent diffusion coefficient (ADC), 5th percentile of ADC, mean normalized cerebral blood volume (nCBV), and 95th percentile of nCBV were assessed via automatic tumor segmentation. Logistic regression was performed to determine the factors associated with CDKN2A/B homozygous deletion in all 88 patients and a subgroup of 47 patients with histological grades 3 and 4. The discrimination performance of the logistic regression models was evaluated using the area under the receiver operating characteristic curve (AUC). Results: In multivariable analysis of all patients, infiltrative pattern (odds ratio [OR] = 4.25, p = 0.034), maximal diameter (OR = 1.07, p = 0.013), and 95th percentile of nCBV (OR = 1.34, p = 0.049) were independent predictors of CDKN2A/B homozygous deletion. The AUC, accuracy, sensitivity, and specificity of the corresponding model were 0.83 (95% confidence interval [CI], 0.72-0.91), 90.4%, 83.3%, and 75.0%, respectively. On multivariable analysis of the subgroup with histological grades 3 and 4, infiltrative pattern (OR = 10.39, p = 0.012) and 95th percentile of nCBV (OR = 1.24, p = 0.047) were independent predictors of CDKN2A/B homozygous deletion, with an AUC accuracy, sensitivity, and specificity of the corresponding model of 0.76 (95% CI, 0.60-0.88), 87.8%, 80.0%, and 58.1%, respectively. Conclusion: The presence of an infiltrative pattern, larger maximal diameter, and higher 95th percentile of the nCBV may be useful MRI biomarkers for CDKN2A/B homozygous deletion in IDH-mutant astrocytomas.
https://doi.org/10.3348/kjr.2022.0732 인용 PDF

Verifying the Classification Accuracy for Korea's Standardized Classification System of Research F&E by using LDA(Linear Discriminant Analysis) (선형판별분석(LDA)기법을 적용한 국가연구시설장비 표준분류체계의 분류 정확도 검증)

Joung, Seokin;Sawng, Yeongwha;Jeong, Euhduck
- Management & Information Systems Review
- /
- v.39 no.1
- /
- pp.35-57
- /
- 2020
Recently, research F&E(Facilities and Equipment) have become very important as tools and means to lead the development of science and technology. The government has been continuously expanding investment budgets for R&D and research F&E, and the need for efficient operation and systematic management of research F&E built up nationwide has increased. In December 2010, The government developed and completed a standardized classification system for national research F&E. However, accuracy and trust of information classification are suspected because information is collected by a method in which a user(researcher) directly selects and registers a classification code in NTIS. Therefore, in the study, we analyzed linearly using linear discriminant analysis(LDA) and analysis of variance(ANOVA), to measure the classification accuracy for the standardized classification system(8 major-classes, 54 sub-classes, 410 small-classes) of the national research facilities and equipment established in 2010, and revised in 2015. For the analysis, we collected and used the information data(50,271 cases) cumulatively registered in NTIS(National Science and Technology Service) for the past 10 years. This is the first case of scientifically verifying the standardized classification system of the national research facilities and equipment, which is based on information of similar classification systems and a few expert reviews in the in-outside of the country. As a result of this study, the discriminant accuracy of major-classes organized hierarchically by sub-classes and small-classes was 92.2 %, which was very high. However, in post hoc verification through analysis of variance, the discrimination power of two classes out of eight major-classes was rather low. It is expected that the standardized classification system of the national research facilities and equipment will be improved through this study.
https://doi.org/10.29214/damis.2020.39.1.003 인용 PDF KSCI

Identification of New, Old and Mixed Brown Rice using Freshness and an Electronic Eye (신선도와 전자눈을 이용한 현미 신곡, 구곡 및 혼합곡의 판별)

Hong, Jee-Hwa;Park, Young-Jun;Kim, Hyun-Tae;Oh, Sang Kyun
- KOREAN JOURNAL OF CROP SCIENCE
- /
- v.63 no.2
- /
- pp.98-105
- /
- 2018
The sale of brown rice batches composed of rice produced in different years is prohibited in Korea. Thus, new methods for the identification of the year of production are critical for maintaining the distribution of high quality brown rice. Here, we describe the exploitation of an enzyme that can be used to discriminate between freshly harvested and one-year-old brown rice. The degree of enzyme activity was visualized through freshness test with Guaiacol, Oxydol, and p-phenylenediamine reagents. With electronic eye equipment, we selected 29 color codes for identifying new brown rice and old brown rice. The discrimination power of selected color codes showed a minimum of 0.263 to a maximum of 0.922 and an average value of 0.62. The accuracy with which new brown rice and old brown rice could be identified was 100% in principal component analysis (PCA) and discriminant function analysis (DFA). The DFA analysis had greater discriminatory power than did the PCA analysis. A verification test using new brown rice, old brown rice, or a mixture of the two was then performed to validate our method. The accuracy of identification of new and old brown rice was 100% in both cases, whereas mixed brown rice samples were correctly classified at a rate of 96.9%. Additionally, in order to test whether the discriminant constructed in winter can be applied to samples collected in summer, new and old brown rice stored for 8 months were collected and tested. Both new and old brown rice collected in summer were classified as old brown rice and showed 50% identification accuracy. We were able to attribute these observations to changes in enzyme content over time, and therefore we conclude, it will be necessary to develop discriminants that are specific to distinct storage periods in the near future.
https://doi.org/10.7740/kjcs.2018.63.2.098 인용 PDF KSCI

Search Result 262, Processing Time 0.033 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)