• Title/Summary/Keyword: Random Model

Search Result 3,752, Processing Time 0.038 seconds

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

Estimation of Annual Trends and Environmental Effects on the Racing Records of Jeju Horses (제주마 주파기록에 대한 연도별 추세 및 환경효과 분석)

  • Lee, Jongan;Lee, Soo Hyun;Lee, Jae-Gu;Kim, Nam-Young;Choi, Jae-Young;Shin, Sang-Min;Choi, Jung-Woo;Cho, In-Cheol;Yang, Byoung-Chul
    • Journal of Life Science
    • /
    • v.31 no.9
    • /
    • pp.840-848
    • /
    • 2021
  • This study was conducted to estimate annual trends and the environmental effects in the racing records of Jeju horses. The Korean Racing Authority (KRA) collected 48,645 observations for 2,167 Jeju horses from 2002 to 2019. Racing records were preprocessed to eliminate errors that occur during the data collection. Racing times were adjusted for comparison between race distances. A stepwise Akaike information criterion (AIC) variable selection method was applied to select appropriate environment variables affecting racing records. The annual improvement of the race time was -0.242 seconds. The model with the lowest AIC value was established when variables were selected in the following order: year, budam classification, jockey ranking, trainer ranking, track condition, weather, age, and gender. The most suitable model was constructed when the jockey ranking and age variables were considered as random effects. Our findings have potential for application as basic data when building models for evaluating genetic abilities of Jeju horses.

Retrieval of Hourly Aerosol Optical Depth Using Top-of-Atmosphere Reflectance from GOCI-II and Machine Learning over South Korea (GOCI-II 대기상한 반사도와 기계학습을 이용한 남한 지역 시간별 에어로졸 광학 두께 산출)

  • Seyoung Yang;Hyunyoung Choi;Jungho Im
    • Korean Journal of Remote Sensing
    • /
    • v.39 no.5_3
    • /
    • pp.933-948
    • /
    • 2023
  • Atmospheric aerosols not only have adverse effects on human health but also exert direct and indirect impacts on the climate system. Consequently, it is imperative to comprehend the characteristics and spatiotemporal distribution of aerosols. Numerous research endeavors have been undertaken to monitor aerosols, predominantly through the retrieval of aerosol optical depth (AOD) via satellite-based observations. Nonetheless, this approach primarily relies on a look-up table-based inversion algorithm, characterized by computationally intensive operations and associated uncertainties. In this study, a novel high-resolution AOD direct retrieval algorithm, leveraging machine learning, was developed using top-of-atmosphere reflectance data derived from the Geostationary Ocean Color Imager-II (GOCI-II), in conjunction with their differences from the past 30-day minimum reflectance, and meteorological variables from numerical models. The Light Gradient Boosting Machine (LGBM) technique was harnessed, and the resultant estimates underwent rigorous validation encompassing random, temporal, and spatial N-fold cross-validation (CV) using ground-based observation data from Aerosol Robotic Network (AERONET) AOD. The three CV results consistently demonstrated robust performance, yielding R2=0.70-0.80, RMSE=0.08-0.09, and within the expected error (EE) of 75.2-85.1%. The Shapley Additive exPlanations(SHAP) analysis confirmed the substantial influence of reflectance-related variables on AOD estimation. A comprehensive examination of the spatiotemporal distribution of AOD in Seoul and Ulsan revealed that the developed LGBM model yielded results that are in close concordance with AERONET AOD over time, thereby confirming its suitability for AOD retrieval at high spatiotemporal resolution (i.e., hourly, 250 m). Furthermore, upon comparing data coverage, it was ascertained that the LGBM model enhanced data retrieval frequency by approximately 8.8% in comparison to the GOCI-II L2 AOD products, ameliorating issues associated with excessive masking over very illuminated surfaces that are often encountered in physics-based AOD retrieval processes.

Effects of vowel types and sentence positions in standard passage on auditory and cepstral and spectral measures in patients with voice disorders (모음 유형과 표준문단의 문장 위치가 음성장애 환자의 청지각적 및 켑스트럼 및 스펙트럼 분석에 미치는 효과)

  • Mi-Hyeon Choi;Seong Hee Choi
    • Phonetics and Speech Sciences
    • /
    • v.15 no.4
    • /
    • pp.81-90
    • /
    • 2023
  • Auditory perceptual assessment and acoustic analysis are commonly used in clinical practice for voice evaluation. This study aims to explore the effects of speech task context on auditory perceptual assessment and acoustic measures in patients with voice disorders. Sustained vowel phonations (/a/, /e/, /i/, /o/, /u/, /ɯ/, /ʌ/) and connected speech (a standardized paragraph 'kaeul' and nine sub-sentences) were obtained from a total of 22 patients with voice disorders. GRBAS ('G', 'R', 'B', 'A', 'S') and CAPE-V ('OS', 'R', 'B', 'S', 'P', 'L') auditory-perceptual assessment were evaluated by two certified speech language pathologists specializing in voice disorders using blind and random voice samples. Additionally, spectral and cepstral measures were analyzed using the analysis of dysphonia in speech and voice model (ADSV).When assessing voice quality with the GRBAS scale, it was not significantly affected by the vowel type except for 'B', while the 'OS', 'R' and 'B' in CAPE-V were affected by the vowel type (p<.05). In addition, measurements of CPP and L/H ratio were influenced by vowel types and sentence positions. CPP values in the standard paragraph showed significant negative correlations with all vowels, with the highest correlation observed for /e/ vowel (r=-.739). The CPP of the second sentence had the strongest correlation with all vowels. Depending on the speech stimulus, CAPE-V may have a greater impact on auditory-perceptual assessment than GRBAS, vowel types and sentence position with consonants influenced the 'B' scale, CPP, and L/H ratio. When using vowels in the voice assessment of patients with voice disorders, it would be beneficial to use not only /a/, but also the vowel /i/, which is acoustically highly correlated with 'breathy'. In addition, the /e/ vowel was highly correlated acoustically with the standardized passage and sub-sentences. Furthermore, given that most dysphonic signals are aperiodic, 2nd sentence of the 'kaeul' passage, which is the most acoustically correlated with all vowels, can be used with CPP. These results provide clinical evidence of the impact of speech tasks on auditory perceptual and acoustic measures, which may help to provide guidelines for voice evaluation in patients with voice disorders.

Stock Price Prediction by Utilizing Category Neutral Terms: Text Mining Approach (카테고리 중립 단어 활용을 통한 주가 예측 방안: 텍스트 마이닝 활용)

  • Lee, Minsik;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.123-138
    • /
    • 2017
  • Since the stock market is driven by the expectation of traders, studies have been conducted to predict stock price movements through analysis of various sources of text data. In order to predict stock price movements, research has been conducted not only on the relationship between text data and fluctuations in stock prices, but also on the trading stocks based on news articles and social media responses. Studies that predict the movements of stock prices have also applied classification algorithms with constructing term-document matrix in the same way as other text mining approaches. Because the document contains a lot of words, it is better to select words that contribute more for building a term-document matrix. Based on the frequency of words, words that show too little frequency or importance are removed. It also selects words according to their contribution by measuring the degree to which a word contributes to correctly classifying a document. The basic idea of constructing a term-document matrix was to collect all the documents to be analyzed and to select and use the words that have an influence on the classification. In this study, we analyze the documents for each individual item and select the words that are irrelevant for all categories as neutral words. We extract the words around the selected neutral word and use it to generate the term-document matrix. The neutral word itself starts with the idea that the stock movement is less related to the existence of the neutral words, and that the surrounding words of the neutral word are more likely to affect the stock price movements. And apply it to the algorithm that classifies the stock price fluctuations with the generated term-document matrix. In this study, we firstly removed stop words and selected neutral words for each stock. And we used a method to exclude words that are included in news articles for other stocks among the selected words. Through the online news portal, we collected four months of news articles on the top 10 market cap stocks. We split the news articles into 3 month news data as training data and apply the remaining one month news articles to the model to predict the stock price movements of the next day. We used SVM, Boosting and Random Forest for building models and predicting the movements of stock prices. The stock market opened for four months (2016/02/01 ~ 2016/05/31) for a total of 80 days, using the initial 60 days as a training set and the remaining 20 days as a test set. The proposed word - based algorithm in this study showed better classification performance than the word selection method based on sparsity. This study predicted stock price volatility by collecting and analyzing news articles of the top 10 stocks in market cap. We used the term - document matrix based classification model to estimate the stock price fluctuations and compared the performance of the existing sparse - based word extraction method and the suggested method of removing words from the term - document matrix. The suggested method differs from the word extraction method in that it uses not only the news articles for the corresponding stock but also other news items to determine the words to extract. In other words, it removed not only the words that appeared in all the increase and decrease but also the words that appeared common in the news for other stocks. When the prediction accuracy was compared, the suggested method showed higher accuracy. The limitation of this study is that the stock price prediction was set up to classify the rise and fall, and the experiment was conducted only for the top ten stocks. The 10 stocks used in the experiment do not represent the entire stock market. In addition, it is difficult to show the investment performance because stock price fluctuation and profit rate may be different. Therefore, it is necessary to study the research using more stocks and the yield prediction through trading simulation.

The Effect of Non-genetic Factors on Birth Weight and Weaning Weight in Three Sheep Breeds of Zimbabwe

  • Assan, N.;Makuza, S.M.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.18 no.2
    • /
    • pp.151-157
    • /
    • 2005
  • Sheep production is affected by genetic and non-genetic factors. A knowledge of these factors is essential for efficient management and for the accurate estimation of breeding values. The objective of this study was to establish the non-genetic factors which affect birth weight and weaning weight in Dorper, Mutton Merino and indigenous Sabi sheep breeds. A total of 2,625 birth and weaning weight records from Grasslands Research Station collected from 1991 through 1993, were used. The records were collected from indigenous Sabi (939), Dorper (807) and Mutton Merino (898) sheep. A mixed classification model containing the fixed effects of year, birth status and sex was used for identification of non-genetic factors. Sire within breed was included as a random effect. Two factor interactions and three factor interactions were important in indigenous Sabi, Mutton Merino and Dorper sheep. The mean birth weights were 4.37${\pm}$0.04 kg, 4.62${\pm}$0.04 kg and 3.29${\pm}$0.04 kg for Mutton Merino, Dorper and Sabi sheep, respectively. Sire had significant effects (p<0.05) on birth weight in Mutton Merino and indigenous Sabi sheep. Year of lambing had significant effects (p<0.05) on birth weight in indigenous Sabi, Mutton Merino and Dorper sheep. The effect of birth status was non significant in Dorper and Mutton Merino sheep while effect of birth status was significant on birth weight in indigenous Sabi sheep. In Indigenous Sabi sheep lambs born as singles (3.30${\pm}$0.05 kg) were 0.23 kg heavier than twins (3.07${\pm}$0.05 kg), in Mutton Merino lambs born as singles (3.99${\pm}$0.08 kg) were 0.07 kg heavier than twins (3.92${\pm}$0.08 kg) and in Dorper lambs born as singles (4.41${\pm}$0.04 kg) were 0.02 kg heavier than twins (4.39${\pm}$0.04 kg). On average males were heavier than females (p<0.05) weighing (3.32${\pm}$0.04 kg vs. 3.05${\pm}$0.07 kg) in indigenous Sabi, 4.73${\pm}$0.03 kg vs. 4.08${\pm}$0.05 in Dorper and 4.26${\pm}$0.07 kg vs. 3.66${\pm}$0.09 kg in Mutton Merino sheep. Two way factor interactions of sire*year, year*sex and sex*birth status had significant effects (p<0.05) on birth weight in indigenous Sabi, Mutton Merino and Dorper sheep while the effect of year*birth status was non significant on birth weight in Indigenous Sabi sheep. The three way factor interaction of year*sex*birth status had a significant effect (p<0.01) on birth weight in indigenous Sabi and Mutton Merino. Tupping weight fitted as a covariate had significant effects (p<0.001) on birth weight in indigenous Sabi, Mutton Merino and Dorper sheep. The mean weaning weights were 17.94${\pm}$0.31 kg, 18.19${\pm}$0.28 kg and 14.39${\pm}$0.28 kg for Mutton Merino, Dorper and Indigenous Sabi sheep, respectively. Effects of sire and sire*year were non significant on weaning weight in Dorper and Mutton Merino while year, sex and sex*year interaction had significant effects (p<0.001) on weaning weight. On average males were heavier than females (p<0.001) at weaning. The respective weaning weights were 18.05${\pm}$0.46 kg, 18.68${\pm}$0.19 kg, 14.14${\pm}$0.15 kg for males and 16.64${\pm}$0.60 kg, 16.41${\pm}$0.31 kg, 12.64${\pm}$0.32 kg for females in Mutton Merino, Dorper and Indigenous Sabi sheep. Lambs born as singles were significantly heavier at weaning than twins, 0.05 kg, 0.06 kg and 0.78 kg for Mutton Merino, Dorper and Indigenous Sabi sheep, respectively. Effect of tupping weight was highly significant on weaning weight. The three way factor interaction year*sex*birth status had a significant effect (p<0.01) on weaning weight. Correction for environmental effects is necessary to increase accuracy of direct selection for birth weight and weaning weight.

Analyses of the Efficiency in Hospital Management (병원 단위비용 결정요인에 관한 연구)

  • Ro, Kong-Kyun;Lee, Seon
    • Korea Journal of Hospital Management
    • /
    • v.9 no.1
    • /
    • pp.66-94
    • /
    • 2004
  • The objective of this study is to examine how to maximize the efficiency of hospital management by minimizing the unit cost of hospital operation. For this purpose, this paper proposes to develop a model of the profit maximization based on the cost minimization dictum using the statistical tools of arriving at the maximum likelihood values. The preliminary survey data are collected from the annual statistics and their analyses published by Korea Health Industry Development Institute and Korean Hospital Association. The maximum likelihood value statistical analyses are conducted from the information on the cost (function) of each of 36 hospitals selected by the random stratified sampling method according to the size and location (urban or rural) of hospitals. We believe that, although the size of sample is relatively small, because of the sampling method used and the high response rate, the power of estimation of the results of the statistical analyses of the sample hospitals is acceptable. The conceptual framework of analyses is adopted from the various models of the determinants of hospital costs used by the previous studies. According to this framework, the study postulates that the unit cost of hospital operation is determined by the size, scope of service, technology (production function) as measured by capacity utilization, labor capital ratio and labor input-mix variables, and by exogeneous variables. The variables to represent the above cost determinants are selected by using the step-wise regression so that only the statistically significant variables may be utilized in analyzing how these variables impact on the hospital unit cost. The results of the analyses show that the models of hospital cost determinants adopted are well chosen. The various models analyzed have the (goodness of fit) overall determination (R2) which all turned out to be significant, regardless of the variables put in to represent the cost determinants. Specifically, the size and scope of service, no matter how it is measured, i. e., number of admissions per bed, number of ambulatory visits per bed, adjusted inpatient days and adjusted outpatients, have overall effects of reducing the hospital unit costs as measured by the cost per admission, per inpatient day, or office visit implying the existence of the economy of scale in the hospital operation. Thirdly, the technology used in operating a hospital has turned out to have its ramifications on the hospital unit cost similar to those postulated in the static theory of the firm. For example, the capacity utilization as represented by the inpatient days per employee tuned out to have statistically significant negative impacts on the unit cost of hospital operation, while payroll expenses per inpatient cost has a positive effect. The input-mix of hospital operation, as represented by the ratio of the number of doctor, nurse or medical staff per general employee, supports the known thesis that the specialized manpower costs more than the general employees. The labor/capital ratio as represented by the employees per 100 beds is shown to have a positive effect on the cost as expected. As for the exogeneous variable's impacts on the cost, when this variable is represented by the percent of urban 100 population at the location where the hospital is located, the regression analysis shows that the hospitals located in the urban area have a higher cost than those in the rural area. Finally, the case study of the sample hospitals offers a specific information to hospital administrators about how they share in terms of the cost they are incurring in comparison to other hospitals. For example, if his/her hospital is of small size and located in a city, he/she can compare the various costs of his/her hospital operation with those of other similar hospitals. Therefore, he/she may be able to find the reasons why the cost of his/her hospital operation has a higher or lower cost than other similar hospitals in what factors of the hospital cost determinants.

  • PDF

Vitamin D and Risk of Respiratory Tract Infections in Children: A Systematic Review and Meta-analysis of Randomized Controlled Trials (비타민 D와 소아 호흡기 감염의 위험성: 무작위 대조 연구에 대한 체계적 문헌고찰 및 메타분석)

  • Ahn, Jong Gyun;Lee, Dokyung;Kim, Kyung-Hyo
    • Pediatric Infection and Vaccine
    • /
    • v.23 no.2
    • /
    • pp.109-116
    • /
    • 2016
  • Purpose: Recent observational studies have found that vitamin D deficiency is associated with respiratory tract infections. However, randomized controlled trials (RCTs) regarding the efficacy of vitamin D in childhood respiratory tract infection (RTI) have yield inconsistent results. We performed a systematic review and meta-analysis to evaluate the association between vitamin D supplementation and the risk of RTI. Methods: A comprehensive search was conducted using MEDLINE, EMBASE, and the Cochrane Central Register of Controlled Trial. Randomized controlled trials of vitamin D supplementation for prevention of RTI in children were included for the analysis. Cochrane Collaboration's tool for assessing the risk of bias was used to assess the quality of the studies. Pooled risk ratios with 95% confidence intervals (CIs) were meta-analyzed using Review Manager 5.3. Results: A total of seven RCTs were included in the meta-analysis. According to a random-effects model, the risk ratio for vitamin D supplementation was 0.82 (95% CI: 0.69-0.98) and $I^2=62%$ for heterogeneity. On subgroup analysis, heterogeneity decreased in the subgroup with follow-up less than 1 year, participants ${\geq}5years$ of age, patients subgroup, and subgroup with dosing daily. Funnel plot showed that there might be publication bias in the field. Conclusions: The present meta-analysis supports a beneficial effect of vitamin D supplementation for the prevention of RTI in children. However, the result should be interpreted with caution due to limitations including a small number of available RCTs, heterogeneity among the studies, and potential publication bias.

A Study on the Factors Affecting Health Promoting Lifestyles of Some Workers (일부 직업인의 건강증진생활양식에 영향을 미치는 요인 연구)

  • Lee Eun-Kyoung;An Byung-Sang;Yu Taek-Su;Kim Seoung-Cheon;Jeung Jea-Yeal;Park Young-Shin;Jahng Doo-Sub;Song Yung-Sun;Lee Ki-Nam
    • Journal of Society of Preventive Korean Medicine
    • /
    • v.4 no.2
    • /
    • pp.119-141
    • /
    • 2000
  • The current industrial health service is shifting to health improvement business with 1st primary prevention-focused service from secondary and tertiary prevention-focused business, and Oriental medicine can provide such primary prevention-focused service due to the characteristics of its science. In particular, the advanced concept of health improvement can match the science of health care of Oriental medicine. Notably, what is most important in health improvement is our lifestyle, This does not underestimate the socio-environmental factors, which have lessened their importance due to modernism. The approach of Oriental medicine weighs more individuals' lifestyle and health care through self-cultivation. This matches the new model of advanced health business. Oriental medicine is less systemized than Western medicine, but it can provide ample contents that enhance health. If we conceive health-improvement program based on the advantages provided by these two medical systems, this will influence workers to the benefit of their health. Also, health Program needs to define factors that determine individual lives, and to provide information and technologies essential to our lives. The Oriental medicine approach puts more stress on a subject's capabilities than it does on the effect his surrounding environment can have. This needs to be supported theoretically by not only defining the relations between an individual's health state and his lifestyle, but also identifying the degree to which an individual in the industrial work place practices health improvement lifestyle . This is the first step toward initiating health-improvement business . In order to do this, this researcher conducted a survey by taking random samplings from workers, and can draw the following conclusions from it. 1 The sampled group is categorized into', by sender, female 6.6%, and male 93.4%, with males dominant; by marriage status , unmarried 43.9% and married 55.6%, with both similar percentage, and, by age, below 30, 48.4%, between 30 and 39, 27.4%, between 40 and 49, 18.2%, and over 50, 6.0%. The group further is categorized into; by education, middle school or under 1.7%, high school 30.5%, and junior college or higher 65.8% with high school and higher dominant: and by income, below 1.7 million won 24.2%, below 2.4 million won 14.8%, and above 2.4 million 6.3% Still, the group by job is categorized into collegians with 23.9%, office worker with 10.3%, and professionals with 65.8% , and this group does not include workers engaged in production that are needed for this research, but mostly office workers . 2. The subjects selected for this survey show their degree of practicing health-improvement lifestyle at an average of 2.63, health management pattern at 2.64, and health-related awareness at 2.62 The sub-divisions of health-improvement lifestyle show social emotion (2.87), food (2.66). favorite food (2.59), and leisure activities (2.52), in this order for higher points. It further shows health awareness (2.47) and safety awareness (2.40), lower points than those in health management pattern . 3. In the area of using leisure time for health-improvement, males, older people, married, and people with higher income earn higher marks. And, in the area of food management, the older and married earn higher marks . In the area of favorite food management, females, lower-income bracket, and lower-educated show higher degree of practice , while in the area of social emotion management, the older. married, and higher-income bracket show higher marks. In addition, in the area of health awareness, the older, married, and people with higher-income show higher degree of practice. 4. To look at correlation by overall and divisional health-improvement practice degree , this researcher has analyzed the data using Person's correlation coefficient. The lifestyle shows significant correlation with its six sub-divisions, and use of leisure time, food, and health awareness all show significant correlation with their sub-divisions. And. the social emotion and safety awareness show significant correlation with all sub-divisions except favorite food management.

  • PDF

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.