• Title/Summary/Keyword: Unbalanced Classification

Search Result 43, Processing Time 0.024 seconds

Improved prediction of soil liquefaction susceptibility using ensemble learning algorithms

  • Satyam Tiwari;Sarat K. Das;Madhumita Mohanty;Prakhar
    • Geomechanics and Engineering
    • /
    • v.37 no.5
    • /
    • pp.475-498
    • /
    • 2024
  • The prediction of the susceptibility of soil to liquefaction using a limited set of parameters, particularly when dealing with highly unbalanced databases is a challenging problem. The current study focuses on different ensemble learning classification algorithms using highly unbalanced databases of results from in-situ tests; standard penetration test (SPT), shear wave velocity (Vs) test, and cone penetration test (CPT). The input parameters for these datasets consist of earthquake intensity parameters, strong ground motion parameters, and in-situ soil testing parameters. liquefaction index serving as the binary output parameter. After a rigorous comparison with existing literature, extreme gradient boosting (XGBoost), bagging, and random forest (RF) emerge as the most efficient models for liquefaction instance classification across different datasets. Notably, for SPT and Vs-based models, XGBoost exhibits superior performance, followed by Light gradient boosting machine (LightGBM) and Bagging, while for CPT-based models, Bagging ranks highest, followed by Gradient boosting and random forest, with CPT-based models demonstrating lower Gmean(error), rendering them preferable for soil liquefaction susceptibility prediction. Key parameters influencing model performance include internal friction angle of soil (ϕ) and percentage of fines less than 75 µ (F75) for SPT and Vs data and normalized average cone tip resistance (qc) and peak horizontal ground acceleration (amax) for CPT data. It was also observed that the addition of Vs measurement to SPT data increased the efficiency of the prediction in comparison to only SPT data. Furthermore, to enhance usability, a graphical user interface (GUI) for seamless classification operations based on provided input parameters was proposed.

A comparative study of feature screening methods for ultrahigh dimensional multiclass classification (초고차원 다범주분류를 위한 변수선별 방법 비교 연구)

  • Lee, Kyungeun;Kim, Kyoung Hee;Shin, Seung Jun
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.793-808
    • /
    • 2017
  • We compare various variable screening methods on multiclass classification problems when the data is ultrahigh-dimensional. Two different approaches were considered: (1) pairwise extension from binary classification via one versus one or one versus rest comparisons and (2) direct classification of multiclass responses. We conducted extensive simulation studies under different conditions: heavy tailed explanatory variables, correlated signal and noise variables, correlated joint distributions but uncorrelated marginals, and unbalanced response variables. We then analyzed real data to examine the performance of the methods. The results showed that model-free methods perform better for multiclass classification problems as well as binary ones.

Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets

  • Mehmet F. Karaca
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.3
    • /
    • pp.591-609
    • /
    • 2024
  • In this study, preprocessings with all combinations were examined in terms of the effects on decreasing word number, shortening the duration of the process and the classification success in balanced and imbalanced datasets which were unbalanced in different ratios. The decreases in the word number and the processing time provided by preprocessings were interrelated. It was seen that more successful classifications were made with Turkish datasets and English datasets were affected more from the situation of whether the dataset is balanced or not. It was found out that the incorrect classifications, which are in the classes having few documents in highly imbalanced datasets, were made by assigning to the class close to the related class in terms of topic in Turkish datasets and to the class which have many documents in English datasets. In terms of average scores, the highest classification was obtained in Turkish datasets as follows: with not applying lowercase, applying stemming and removing stop words, and in English datasets as follows: with applying lowercase and stemming, removing stop words. Applying stemming was the most important preprocessing method which increases the success in Turkish datasets, whereas removing stop words in English datasets. The maximum scores revealed that feature selection, feature size and classifier are more effective than preprocessing in classification success. It was concluded that preprocessing is necessary for text classification because it shortens the processing time and can achieve high classification success, a preprocessing method does not have the same effect in all languages, and different preprocessing methods are more successful for different languages.

A Study on the Language and Literature Division of the Dewey Decimal Classification (DDC의 어문학구분에 관한 연구)

  • Nam Tae Woo
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.21
    • /
    • pp.1-60
    • /
    • 1991
  • In this study, Two divisions(language and literature) in schemes of the DDC are discussed. And the adaptation of these divisions to the minor or oriential countries are seggested. In spite of the continuous study and revision of the experts, the frameworks of these systems are still kept unchanged. Only their subdivisions, reflecting those developments In the academic world are developed and detailed more sophisticatedly. Of those subdivisions in DDC, especially the two subdivisions of language and literature are seriously unbalanced. The two divisions give the attention too much to the Western including the English, Deutsch and French. Relatively the languages and literatures of the other nations are treated lightly. It results more problems to the Oriental and the minor nations. So, the libraries of these nations should modify the schedules and develop the subdivisions items of the local emphasis. Considering these problems the historical changes of the DDC system in the languages and literatures are clarified and the problems occurring from unbalanced allocation of the classed items are examined.

  • PDF

An Enhanced Feature Selection Method Based on the Impurity of Words Considering Unbalanced Distribution of Documents (문서의 불균등 분포를 고려한 단어 불순도 기반 특징 선택 방법)

  • Kang, Jin-Beom;Yang, Jae-Young;Choi, Joong-Min
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.9
    • /
    • pp.804-816
    • /
    • 2007
  • Sample training data for machine learning often contain irrelevant information or redundant concept. It is also the case that the original data may include noise. If the information collected for constructing learning model is not reliable, it is difficult to obtain accurate information. So the system attempts to find relations or regulations between features and categories in the teaming phase. The feature selection is to remove irrelevant or redundant information before constructing teaming model. for improving its performance. Existing feature selection methods assume that the distribution of documents is balanced in terms of the number of documents for each class and the length of each document. In practice, however, it is difficult not only to prepare a set of documents with almost equal length, but also to define a number of classes with fixed number of document elements. In this paper, we propose a new feature selection method that considers the impurities among the words and unbalanced distribution of documents in categories. We could obtain feature candidates using the word impurity and eventually select the features through unbalanced distribution of documents. We demonstrate that our method performs better than other existing methods via some experiments.

Application of Multiple Parks Vector Approach for Detection of Multiple Faults in Induction Motors

  • Vilhekar, Tushar G.;Ballal, Makarand S.;Suryawanshi, Hiralal M.
    • Journal of Power Electronics
    • /
    • v.17 no.4
    • /
    • pp.972-982
    • /
    • 2017
  • The Park's vector of stator current is a popular technique for the detection of induction motor faults. While the detection of the faulty condition using the Park's vector technique is easy, the classification of different types of faults is intricate. This problem is overcome by the Multiple Park's Vector (MPV) approach proposed in this paper. In this technique, the characteristic fault frequency component (CFFC) of stator winding faults, rotor winding faults, unbalanced voltage and bearing faults are extracted from three phase stator currents. Due to constructional asymmetry, under the healthy condition these characteristic fault frequency components are unbalanced. In order to balanced them, a correction factor is added to the characteristic fault frequency components of three phase stator currents. Therefore, the Park's vector pattern under the healthy condition is circular in shape. This pattern is considered as a reference pattern under the healthy condition. According to the fault condition, the amplitude and phase of characteristic faults frequency components changes. Thus, the pattern of the Park's vector changes. By monitoring the variation in multiple Park's vector patterns, the type of fault and its severity level is identified. In the proposed technique, the diagnosis of faults is immune to the effects of unbalanced voltage and multiple faults. This technique is verified on a 7.5 hp three phase wound rotor induction motor (WRIM). The experimental analysis is verified by simulation results.

An Efficient Wireless Signal Classification Based on Data Augmentation (데이터 증강 기반 효율적인 무선 신호 분류 연구 )

  • Sangsoon Lim
    • Journal of Platform Technology
    • /
    • v.10 no.4
    • /
    • pp.47-55
    • /
    • 2022
  • Recently, diverse devices using different wireless technologies are gradually increasing in the IoT environment. In particular, it is essential to design an efficient feature extraction approach and detect the exact types of radio signals in order to accurately identify various radio signal modulation techniques. However, it is difficult to gather labeled wireless signal in a real environment due to the complexity of the process. In addition, various learning techniques based on deep learning have been proposed for wireless signal classification. In the case of deep learning, if the training dataset is not enough, it frequently meets the overfitting problem, which causes performance degradation of wireless signal classification techniques using deep learning models. In this paper, we propose a generative adversarial network(GAN) based on data augmentation techniques to improve classification performance when various wireless signals exist. When there are various types of wireless signals to be classified, if the amount of data representing a specific radio signal is small or unbalanced, the proposed solution is used to increase the amount of data related to the required wireless signal. In order to verify the validity of the proposed data augmentation algorithm, we generated the additional data for the specific wireless signal and implemented a CNN and LSTM-based wireless signal classifier based on the result of balancing. The experimental results show that the classification accuracy of the proposed solution is higher than when the data is unbalanced.

A Study of the 780 Music of DDC (DDC에 있어서의 음악분야 분류상의 제문제)

  • Hahn Kyung-Shin
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.26
    • /
    • pp.75-112
    • /
    • 1994
  • The purpose of this study is to investigate the problems concerning 780 music division of DDC. The object is especially arrangement of 780 music in the 20th edition of DDC which is the complete revision. The result is summarized as follows : 1. Although music is an important subject in humanities, especially in arts, it was classified as one division (780) not class. 2. The arrangement of 780 music is severely west-oriented music theory, vocal music and instrumental music. 3. Classification number of 780 music becomes longer because of the limitation of decimal notation. 4. 780 music division of DDC neglects music theory and emphasizes music practicing, especially performance. 5. The assignment of classification number is unbalanced, especially between theory and practice, composition and performance, and among sub-sections of vocal and instrumental music. 6. Many important subject are omitted in DDC music schedule, for example, musicology and branches of musicology, composition and traditional instruments of many countries. 7. Employment of terminology is often improper and inconsistant.

  • PDF

Nutritional status of children with cerebral palsy according to their body mass index percentile classification

  • Ahmed, Kainat;Kim, Hyo-Jung;Han, Kyungim;Yim, Jung-Eun
    • Journal of Nutrition and Health
    • /
    • v.54 no.5
    • /
    • pp.474-488
    • /
    • 2021
  • Purpose: Malnutrition in children with cerebral palsy (CP) is a significant factor affecting their adequate growth and development. This study aimed at conducting surveys and evaluating the dietary intake of children with CP according to their BMI classification and to thereby highlight the dietary factors affecting the nutritional status of these children. Methods: A total of 16 children were enrolled between the age of four and twelve. These subjects were further classified into three groups, namely underweight, normal and obese, with 6, 8, and 2 children in each group, respectively. The general characteristics, motor disturbances, body composition, feeding problems, eating habits, nutritional intake, dietary variety, and food frequency for children with CP were evaluated. Results: It was observed that motor disturbances tended to increase in underweight children with CP. A significant decrease (p < 0.05) in disturbances related to oral feeding was observed with an increase in obesity. The pattern of eating habits revealed that subjects in the underweight group consumed unbalanced meals, while those in the obese group tended to consume larger meals at a faster pace. The feeding disturbance data revealed that those in the underweight group could not prepare their meals while the obese group had the problem of overeating and consuming an unbalanced diet (p < 0.05). Conclusion: It is necessary for both children with CP, who have a high degree of disability, and their caregivers to take lessons on adequate nutrient intake to prevent malnutrition. Moreover, it is necessary for the caregivers and children with CP having a low degree of disability to take lessons on providing and consuming a balanced diet and to focus on the intake of sufficient calcium in order to prevent obesity.

Prediction of Diabetic Nephropathy from Diabetes Dataset Using Feature Selection Methods and SVM Learning (특징점 선택방법과 SVM 학습법을 이용한 당뇨병 데이터에서의 당뇨병성 신장합병증의 예측)

  • Cho, Baek-Hwan;Lee, Jong-Shill;Chee, Young-Joan;Kim, Kwang-Won;Kim, In-Young;Kim, Sun-I.
    • Journal of Biomedical Engineering Research
    • /
    • v.28 no.3
    • /
    • pp.355-362
    • /
    • 2007
  • Diabetes mellitus can cause devastating complications, which often result in disability and death, and diabetic nephropathy is a leading cause of death in people with diabetes. In this study, we tried to predict the onset of diabetic nephropathy from an irregular and unbalanced diabetic dataset. We collected clinical data from 292 patients with type 2 diabetes and performed preprocessing to extract 184 features to resolve the irregularity of the dataset. We compared several feature selection methods, such as ReliefF and sensitivity analysis, to remove redundant features and improve the classification performance. We also compared learning methods with support vector machine, such as equal cost learning and cost-sensitive learning to tackle the unbalanced problem in the dataset. The best classifier with the 39 selected features gave 0.969 of the area under the curve by receiver operation characteristics analysis, which represents that our method can predict diabetic nephropathy with high generalization performance from an irregular and unbalanced dataset, and physicians can benefit from it for predicting diabetic nephropathy.