The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)
-
- Journal of Intelligence and Information Systems
- /
- v.26 no.1
- /
- pp.23-45
- /
- 2020
Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.
Malnutrition of hospitalized patients can adversely affect clinical outcomes and cost. Several nutritional screening tools have been developed to identify patients with malnutrition risk. However, many of those possess practical pitfalls of requiring much time and labor to administer and may not be highly applicable to a Korean population. This study sought to develop and evaluate a Nutrition Risk Screening Tool (NRST) which is simple and quick to administer and widely applicable to Korean hospitalized patients with various diseases. The study was also designed to generate a screening tool predictable of various clinical outcomes and to validate it against the Nutritional Risk Screening 2002 (NRS 2002). Electronic medical records of 424 patients hospitalized at a general hospital in Seoul during a 14-month period were abstracted for anthropometric, medical, biochemical, and clinical outcome variables. The study employed a 4-step process consisting of selecting NRST components, searching a scoring scheme, validating against a reference tool, and confirming clinical outcome predictability. NRST components were selected by stepwise multiple regression analysis of each clinical outcome (i.e., hospitalization period, complication, disease progress, and death) on several readily available patient characteristics. Age and serum levels of albumin, hematocrit (Hct), and total lymphocyte count (TLC) remained in the last model for any of 4 dependent variables were decided as NRST components. Odds ratios of malnutrition risk based on NRS 2002 according to levels of the selected components were utilized to frame a scoring scheme of NRST. A NRST score higher than 3.5 was set as a cut-off score for malnutrition risk based on sensitivity and specificity levels against NRS 2002. Lastly differences in clinical outcomes by patients' NRST results were examined. The results showed that the NRST can significantly predict the in-hospital clinical outcomes. It is concluded that the NRST can be useful to simply and quickly screen patients at high-nutritional risk in relation to prospective clinical outcomes.
The purpose of this study is to identify and evaluate the competitiveness of ports in ASEAN(Association of Southeast Asian Nations), which plays a leading role in basing the hub of international logistics strategies as a countermeasure in changes of international logistics environments. This region represents most severe competition among Mega hub ports in the world in terms of container cargo throughput at the onset of the 21 st century. The research method in this study accounted for overlapping between attributes, and introduced the HFP method that can perform mathematical operations. The scope of this study was strictly confined to the ports of ASEAN. which cover the top 100 of 350 container ports that were presented in Containerization International Yearbook 2002 with reference to container throughput. The results of this study show Singapore in the number one position. Even compared with major ports in Korea (after getting comparative ratings and applying the same data and evaluation structure), the number one position still goes to Singapore and then Busan(2) and Manila(2), followed by Port Klang(4), Tanjugn Priok(5), Tanjung Perak(6), Bangkok(7), Inchon(8), Laem Chabang(9) and Penang(9). In terms of the main contributions of this study, it is the first empirical study to apply the combined attributes of detailed and representative attributes into the advanced HFP model which was enhanced by the KJ method to evaluate the port competitiveness in ASEAN. Up-to-now, none have comprehensively conducted researches with sophisticated port methodology that has discussed a variety of changes in port development and terminal transfers of major shipping lines. Moreover, through the comparative evaluation between major ports in Korea and ASEAN, the presentation of comparative competitiveness for Korea ports is a great achievement in this study. In order to reinforce this study, it needs further compensative research, including cost factors which could not be applied to modeling the subject ports by lack of consistently qualified in ASEAN.
Transportation project appraisal should be precise in order to increase the social welfare and efficiency, and it has been evaluated by only a single criterion analysis such as benefit/cost analysis. However, this method cannot assess some qualitative items, and cannot get a proper solution for the clash of interests among various groups. Therefore, the multi-criteria analysis, which can control these problems, is needed, and then Saaty has developed one of these methods, AHP(Analytic Hierarchy Process) method. In AHP, the project is evaluated through weighted score of the criteria and the alternatives, which is surveyed by a questionnaire of specialists. It is based on some strict suppositions such as reciprocal comparison, homogeneity, expectation, independence relationship between multi-criteria, but supposing that each criterion has independence relation with others is too difficult in two reasons. First, in real situation, there cannot be perfect independence relationship between standards. Second, individuals, even though they are specialists of that area, do not feel the degree of independence relation as same as others. This paper develops a modified AHP method for solving this dependence relationship between multi-criteria. First of all. in this method, the degree of dependence relationship between multi-criteria that the specialist feels is surveyed and included to the weighted score of multi-criteria This study supposes three methods to implement this idea. The first model products the degree of dependence relationship in the first step for calculating the weighted score, and the others adjust the result of weighted score from the basic AHP method to the dependence relationship. One of the second methods distributes the cross weighted score to each standard by constant ratio, and the other splits them using Fuzzy measure such as Bel and Pl. Finally, in order to validate these methods, this paper applies them to evaluate the alternatives which can control public resentments against Korean rail path in a city area.
We determined the current problem of the restoration deposit-estimation system, stipulated by the Mountainous Districts Management Act, using the Delphi technique. Consequently, we proposed a standard model for forest land restoration to derive a reasonable deposit-estimation system. With the result of the Delphi survey, the inappropriateness of land-use type and slope gradient classifications was shown; the insufficiency of standard works was a significant problem in the current system. A way to solve these problems was devised, to reorganize the current land-use type into the subject of the site. The specific subjects included the following: (i) to permit or report forest land-use change and temporary use of forest land, (ii) to report temporary use of forest land, (iii) to permit stone collection or sale for mineral mining, and (iv) to allow sediment collection. The current slope gradient subdivision into (a) θ<10°, (b) 10°≦θ<15°, (c) 15°≦θ<20°, (d) 20°≦θ<25°, (e) 25°≦θ<30°, and (f) θ≧30° and the reorganization of 17 standard works into 22 standard works were deemed as solutions, along with seven additional works. We developed 24 standard models for the forest land restoration project based on the aforementioned results. The deposits estimated by these models ranged from 34,185,000 (Korean) won to 607,403,000 won. If additional works, premiums, discounts, and supervision fees are added to the models, the deposit increases to an estimated 668,143,000 won subject to permission for stone collection or sale and mineral mining. Experts agree on the distribution of the restoration deposits estimated by these models at a high level in the Delphi survey. Our findings are expected to contribute to securing the appropriateness of the restoration cost deposited for the smooth performance of the vicariously executed restoration project.
Due to the development of the fourth industrial revolution technology, efforts are being made to improve areas that humans cannot handle by utilizing artificial intelligence techniques such as machine learning. Although on-demand production companies also want to reduce corporate risks such as delays in delivery by predicting total production time for orders, they are having difficulty predicting this because the total production time is all different for each order. The Theory of Constraints (TOC) theory was developed to find the least efficient areas to increase order throughput and reduce order total cost, but failed to provide a forecast of total production time. Order production varies from order to order due to various customer needs, so the total production time of individual orders can be measured postmortem, but it is difficult to predict in advance. The total measured production time of existing orders is also different, which has limitations that cannot be used as standard time. As a result, experienced managers rely on persimmons rather than on the use of the system, while inexperienced managers use simple management indicators (e.g., 60 days total production time for raw materials, 90 days total production time for steel plates, etc.). Too fast work instructions based on imperfections or indicators cause congestion, which leads to productivity degradation, and too late leads to increased production costs or failure to meet delivery dates due to emergency processing. Failure to meet the deadline will result in compensation for delayed compensation or adversely affect business and collection sectors. In this study, to address these problems, an entity that operates an order production system seeks to find a machine learning model that estimates the total production time of new orders. It uses orders, production, and process performance for materials used for machine learning. We compared and analyzed OLS, GLM Gamma, Extra Trees, and Random Forest algorithms as the best algorithms for estimating total production time and present the results.
The software industry is a high value-added industry in the knowledge information age, and its importance is growing as it not only plays a key role in knowledge creation and utilization, but also secures global competitiveness. Among various SW available in today's business environment, Open Source Software(OSS) is rapidly expanding its activity area by not only leading software development, but also integrating with new information technology. Therefore, the purpose of this research is to empirically examine and analyze the effect of factors on the switching behavior to OSS. To accomplish the study's purpose, we suggest the research model based on "Push-Pull-Mooring" framework. This study empirically examines the two categories of antecedents for switching behavior toward OSS. The survey was conducted to employees at various firms that already switched OSS. A total of 268 responses were collected and analyzed by using the structural equational modeling. The results of this study are as follows; first, continuous maintenance cost, vender dependency, functional indifference, and SW resource inefficiency are significantly related to switch to OSS. Second, network-oriented support, testability and strategic flexibility are significantly related to switch to OSS. Finally, the results show that willingness to secures SW competitiveness has a moderating effect on the relationships between push factors and pull factor with exception of improved knowledge, and switch to OSS. The results of this study will contribute to fields related to OSS both theoretically and practically.
As the pace of competition dramatically accelerates and the complexity of change grows, a variety of research have been conducted to improve firms' short-term performance and to enhance firms' long-term survival. In particular, researchers and practitioners have paid their attention to identify promising technologies that lead competitive advantage to a firm. Discovery of promising technology depends on how a firm evaluates the value of technologies, thus many evaluating methods have been proposed. Experts' opinion based approaches have been widely accepted to predict the value of technologies. Whereas this approach provides in-depth analysis and ensures validity of analysis results, it is usually cost-and time-ineffective and is limited to qualitative evaluation. Considerable studies attempt to forecast the value of technology by using patent information to overcome the limitation of experts' opinion based approach. Patent based technology evaluation has served as a valuable assessment approach of the technological forecasting because it contains a full and practical description of technology with uniform structure. Furthermore, it provides information that is not divulged in any other sources. Although patent information based approach has contributed to our understanding of prediction of promising technologies, it has some limitations because prediction has been made based on the past patent information, and the interpretations of patent analyses are not consistent. In order to fill this gap, this study proposes a technology forecasting methodology by integrating patent information approach and artificial intelligence method. The methodology consists of three modules : evaluation of technologies promising, implementation of technologies value prediction model, and recommendation of promising technologies. In the first module, technologies promising is evaluated from three different and complementary dimensions; impact, fusion, and diffusion perspectives. The impact of technologies refers to their influence on future technologies development and improvement, and is also clearly associated with their monetary value. The fusion of technologies denotes the extent to which a technology fuses different technologies, and represents the breadth of search underlying the technology. The fusion of technologies can be calculated based on technology or patent, thus this study measures two types of fusion index; fusion index per technology and fusion index per patent. Finally, the diffusion of technologies denotes their degree of applicability across scientific and technological fields. In the same vein, diffusion index per technology and diffusion index per patent are considered respectively. In the second module, technologies value prediction model is implemented using artificial intelligence method. This studies use the values of five indexes (i.e., impact index, fusion index per technology, fusion index per patent, diffusion index per technology and diffusion index per patent) at different time (e.g., t-n, t-n-1, t-n-2,
The wall shear stress in the vicinity of end-to end anastomoses under steady flow conditions was measured using a flush-mounted hot-film anemometer(FMHFA) probe. The experimental measurements were in good agreement with numerical results except in flow with low Reynolds numbers. The wall shear stress increased proximal to the anastomosis in flow from the Penrose tubing (simulating an artery) to the PTFE: graft. In flow from the PTFE graft to the Penrose tubing, low wall shear stress was observed distal to the anastomosis. Abnormal distributions of wall shear stress in the vicinity of the anastomosis, resulting from the compliance mismatch between the graft and the host artery, might be an important factor of ANFH formation and the graft failure. The present study suggests a correlation between regions of the low wall shear stress and the development of anastomotic neointimal fibrous hyperplasia(ANPH) in end-to-end anastomoses. 30523 T00401030523 ^x Air pressure decay(APD) rate and ultrafiltration rate(UFR) tests were performed on new and saline rinsed dialyzers as well as those roused in patients several times. C-DAK 4000 (Cordis Dow) and CF IS-11 (Baxter Travenol) reused dialyzers obtained from the dialysis clinic were used in the present study. The new dialyzers exhibited a relatively flat APD, whereas saline rinsed and reused dialyzers showed considerable amount of decay. C-DAH dialyzers had a larger APD(11.70
The wall shear stress in the vicinity of end-to end anastomoses under steady flow conditions was measured using a flush-mounted hot-film anemometer(FMHFA) probe. The experimental measurements were in good agreement with numerical results except in flow with low Reynolds numbers. The wall shear stress increased proximal to the anastomosis in flow from the Penrose tubing (simulating an artery) to the PTFE: graft. In flow from the PTFE graft to the Penrose tubing, low wall shear stress was observed distal to the anastomosis. Abnormal distributions of wall shear stress in the vicinity of the anastomosis, resulting from the compliance mismatch between the graft and the host artery, might be an important factor of ANFH formation and the graft failure. The present study suggests a correlation between regions of the low wall shear stress and the development of anastomotic neointimal fibrous hyperplasia(ANPH) in end-to-end anastomoses. 30523 T00401030523 ^x Air pressure decay(APD) rate and ultrafiltration rate(UFR) tests were performed on new and saline rinsed dialyzers as well as those roused in patients several times. C-DAK 4000 (Cordis Dow) and CF IS-11 (Baxter Travenol) reused dialyzers obtained from the dialysis clinic were used in the present study. The new dialyzers exhibited a relatively flat APD, whereas saline rinsed and reused dialyzers showed considerable amount of decay. C-DAH dialyzers had a larger APD(11.70