• Title/Summary/Keyword: CHAID Analysis

Search Result 46, Processing Time 0.019 seconds

Selecting the optimal threshold based on impurity index in imbalanced classification (불균형 자료에서 불순도 지수를 활용한 분류 임계값 선택)

  • Jang, Shuin;Yeo, In-Kwon
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.5
    • /
    • pp.711-721
    • /
    • 2021
  • In this paper, we propose the method of adjusting thresholds using impurity indices in classification analysis on imbalanced data. Suppose the minority category is Positive and the majority category is Negative for the imbalanced binomial data. When categories are determined based on the commonly used 0.5 basis, the specificity tends to be high in unbalanced data while the sensitivity is relatively low. Increasing sensitivity is important when proper classification of objects in minority categories is relatively important. We explore how to increase sensitivity through adjusting thresholds. Existing studies have adjusted thresholds based on measures such as G-Mean and F1-score, but in this paper, we propose a method to select optimal thresholds using the chi-square statistic of CHAID, the Gini index of CART, and the entropy of C4.5. We also introduce how to get a possible unique value when multiple optimal thresholds are obtained. Empirical analysis shows what improvements have been made compared to the results based on 0.5 through classification performance metrics.

Evaluation on Performance of Accuracy for Analysis and Classification of Data Related to Industrial Accidents (산업재해 데이터의 분석 및 분류를 위한 정확도 성능 평가)

  • Leem Young-Moon;Ryu Chang-Hyun
    • Proceedings of the Safety Management and Science Conference
    • /
    • 2006.04a
    • /
    • pp.51-56
    • /
    • 2006
  • Recently data mining techniques have been used for analysis and classification of data related to industrial accidents. The main objective of this study is to compare performance of algorithms for data analysis of industrial accidents and this paper provides a comparative analysis of 5 kinds of algorithms including CHAID, CART, C4.5, LR (Logistic Regression) and NN (Neural Network) with ROC chart, lift chart and response threshold. In this study, data on 67,278 accidents were analyzed to create risk groups for a number of complications, including the risk of disease and accident. The sample for this work chosen from data related to manufacturing industries during three years $(2002\sim2004)$ in korea. According to the result analysis, NN has excellent performance for data analysis and classification of industrial accidents.

  • PDF

Classification Tree Analysis to Assess Contributing Factors Influencing Biosecurity Level on Farrow-to-Finish Pig Farms in Korea (분류 트리 기법을 이용한 국내 일괄사육 양돈장의 차단방역 수준에 영향을 미치는 기여 요인 평가)

  • Kim, Kyu-Wook;Pak, Son-Il
    • Journal of Veterinary Clinics
    • /
    • v.33 no.2
    • /
    • pp.107-112
    • /
    • 2016
  • The objective of this study was to determine potential contributing factors associated with biosecurity level of farrow-to-finish pig farms and to develop a classification tree model to explore how these factors related to each other based on prediction model. To this end, the author analyzed data (n = 193) extracted from a cross-sectional study of 344 farrow-to-finish farms which was conducted between March and September 2014 aimed to explore swine disease status at farm level. Standardized questionnaires with information about basic demographical data and management practices were collected in each farm by on-site visit of trained veterinarians. For the classification of the data sets regarding biosecurity level as a dependent variable and predictor variables, Chi-squared Automatic Interaction Detection (CHAID) algorithm was applied for modeling classification tree. The statistics of misclassification risk was used to evaluate the fitness of the model in terms of prediction results. Categorical multivariate input data (40 variables) was used to construct a classification tree, and the target variable was biosecurity level dichotomized into low versus high. In general, the level of biosecurity was lower in the majority of farms studied, mainly due to the limited implementation of on-farm basic biosecurity measures aimed at controlling the potential introduction and transmission of swine diseases. The CHAID model illustrated the relative importance of significant predictors in explaining the level of biosecurity; maintenance of medical records of treatment and vaccination, use of dedicated clothing to enter the farm, installing fence surrounding the farm perimeter, and periodic monitoring of the herd using written biosecurity plan in place. The misclassification risk estimate of the prediction model was 0.145 with the standard error of 0.025, indicating that 85.5% of the cases could be classified correctly by using the decision rule based on the current tree. Although CHAID approach could provide detailed information and insight about interactions among factors associated with biosecurity level, further evaluation of potential bias intervened in the course of data collection should be included in future studies. In addition, there is still need to validate findings through the external dataset with larger sample size to improve the external validity of the current model.

A Study on the Forecasting Trend of Apartment Prices: Focusing on Government Policy, Economy, Supply and Demand Characteristics (아파트 매매가 추이 예측에 관한 연구: 정부 정책, 경제, 수요·공급 속성을 중심으로)

  • Lee, Jung-Mok;Choi, Su An;Yu, Su-Han;Kim, Seonghun;Kim, Tae-Jun;Yu, Jong-Pil
    • The Journal of Bigdata
    • /
    • v.6 no.1
    • /
    • pp.91-113
    • /
    • 2021
  • Despite the influence of real estate in the Korean asset market, it is not easy to predict market trends, and among them, apartments are not easy to predict because they are both residential spaces and contain investment properties. Factors affecting apartment prices vary and regional characteristics should also be considered. This study was conducted to compare the factors and characteristics that affect apartment prices in Seoul as a whole, 3 Gangnam districts, Nowon, Dobong, Gangbuk, Geumcheon, Gwanak and Guro districts and to understand the possibility of price prediction based on this. The analysis used machine learning algorithms such as neural networks, CHAID, linear regression, and random forests. The most important factor affecting the average selling price of all apartments in Seoul was the government's policy element, and easing policies such as easing transaction regulations and easing financial regulations were highly influential. In the case of the three Gangnam districts, the policy influence was low, and in the case of Gangnam-gu District, housing supply was the most important factor. On the other hand, 6 mid-lower-level districts saw government policies act as important variables and were commonly influenced by financial regulatory policies.

The Prediction Model for Self-Reported Voice Problem Using a Decision Tree Model (의사결정나무 모형을 이용한 주관적 음성장애 예측모형)

  • Byeon, Haewon
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.14 no.7
    • /
    • pp.3368-3373
    • /
    • 2013
  • The purpose of this study was to analyze the risk factors of self-reported voice problem. Data were from the Korea National Health and Nutritional Examination Survey 2008. Subjects were 3,600 persons (1,501 men, 2,099 women) aged 19 years and older. A prediction model was developed by the use of a exhaustive CHAID (Chi Squared Automatic Interaction Detection) algorism of decision tree model. In the decision tree analysis, pain and discomfort during the last 2 weeks, age, the longest occupation and thyroid disorders was significantly associated with self-reported voice problem. The findings of associated factors suggest potential ways of targeting counseling and prevention efforts to control self-reported voice problem.

Market Segmentation of Outpatient Services on the based of Consumption Values in Hospitals (소비가치에 의한 외래서비스 이용환자의 시장세분화에 관한 연구)

  • Kwon, Chin;Lee, Sun-Hee;Sohn, Myong-Sei
    • Korea Journal of Hospital Management
    • /
    • v.2 no.1
    • /
    • pp.96-113
    • /
    • 1997
  • This study was performed to analyze of market segmentation of outpatient services on the based of consumption values. Self-reported questionnaires of six hospital outpatients 600 were analyzed by six consumption values categories: functional values, social values, emotional value, rarity value, situational values, health related values. The main results of this research is as following; 1. The consumption values were significantly different in that sociodemographic characteristics. Especially, the more older aged group, farmer and married people, the more they preferred to functional value, social value, emotional value and rarity value than younger aged group and unmarried people. But in the cases of situational vaue, younger aged people and white-callar workers recognized more positively. Also, housewives, married people and female recognized more positively than white-callar workers, unmarried people and male. 2. In the results of CHAID analysis, market of general hospital were analyzed by 9 categories and major market were groups who ignored or were unconcerned about newness/classiness and preferred to nearness to residence. The market of university hospital were analyzed by 8 categories and major market were groups who considered to reliability/social reputation importantly. The market of corporate hospitals were analyzed by 8 and major market were group who considered to classiness/newness importantly. Therefore, above results show that health care market can be divided to various market by demand and market segmentation is very important for marketing strategy.

  • PDF

A Study on Sensor Data Analysis and Product Defect Improvement for Smart Factory (스마트 팩토리를 위한 센서 데이터 분석과 제품 불량 개선 연구)

  • Hwang, Sewong;Kim, Jonghyuk;Hwangbo, Hyunwoo
    • The Journal of Bigdata
    • /
    • v.3 no.1
    • /
    • pp.95-103
    • /
    • 2018
  • In recent years, many people in the manufacturing field have been making efforts to increase efficiency while analyzing manufacturing data generated in the process according to the development of ICT technology. In this study, we propose a data mining based manufacturing process using decision tree algorithm (CHAID) as part of a smart factory. We used 432 sensor data from actual manufacturing plant collected for about 5 months to find out the variables that show a significant difference between the stable process period with low defect rate and the unstable process period with high defect rate. We set the range of the stable value of the variable to determine whether the selected final variable actually has an effect on the defect rate improvement. In addition, we measured the effect of the defect rate improvement by adjusting the process set-point so that the sensor did not deviate from the stable value range in the 14 day process. Through this, we expect to be able to provide empirical guidelines to improve the defect rate by utilizing and analyzing the process sensor data generated in the manufacturing industry.

A Study on Segmentation of Preferred Characteristics of Rural Tourists after COVID-19 Using Decision Tree Analysis (의사결정나무분석을 활용한 코로나19 이후 농촌관광객의 선호 특성 세분화 연구)

  • Seung-Hun Lee
    • Asia-Pacific Journal of Business
    • /
    • v.14 no.1
    • /
    • pp.411-426
    • /
    • 2023
  • Purpose - The purpose of this study was to explore and diagnose the characteristics and behavioural patterns of rural tourists after COVID-19 using decision tree analysis to classify and identify key segmentation groups. Design/methodology/approach - The CHAID algorithm was used as the analysis technique for the decision tree. The explanatory variables used in the analysis of each decision tree model were demographic variables and rural tourism usage behaviour and perception variables, and the target variables were the preferences of rural tourists' activities after COVID-19. From the Rural Tourism 2020 survey data, 614 samples with rural tourism experience were extracted and used in the analysis. Findings - The variables that significantly explained the preference for each type of rural tourism activity after COVID-19 were rural tourism safety perception, repeated visits to the region, rural tourism priority activity, rural tourism accommodation experience, gender, age group, marital status, occupation, and education level. Among them, rural tourism safety perception was the most important explanatory variable in each analysis model. Research implications or Originality - Overall, to promote rural tourism, it is necessary to enhance the safety image of rural tourism, strengthen loyalty programs for repeat visitors, and develop customized products that reflect the preferred trends of rural tourism.

Convergence Analysis of Risk factors for Readmission in Cardiovascular Disease: A Machine Learning Approach (의사결정나무분석을 이용한 심혈관질환자의 재입원 위험 요인에 대한 융합적 분석)

  • Kim, Hyun-Su
    • Journal of Convergence for Information Technology
    • /
    • v.9 no.12
    • /
    • pp.115-123
    • /
    • 2019
  • This is descriptive study to 2nd analysis data KNHANES IV-VI about risk factors of readmission among patients with cardiovascular disease. Among the total 65,973 adults, 1,037 with angina or myocardial infarction were analyzed. The analysis was conducted using SPSS window 21 Program and CHAID decision tree was used in the classification analysis. Root nodes are economic activity(χ2=12.063, p=.001), children's nodes are personal income(χ2=6.575, p=.031), weight change(χ2=12.758, p=.001), residential area(χ2=4.025, p=.045), direct smoking(χ2=3.884, p=.031). p=.049), level of education(χ2=9.630, p=.024). Terminal nodes are hypertension(χ2=3.854, p=.050), diabetes mellitus(χ2=6.056, p=.014), occupation type(χ2=7.799, p=.037). We suggest that the development and operation of programs considering the integrated approach of various factors is necessary for the readmission management of cardiovascular patients.

Selection of an Optimal Algorithm among Decision Tree Techniques for Feature Analysis of Industrial Accidents in Construction Industries (건설업의 산업재해 특성분석을 위한 의사결정나무 기법의 상용 최적 알고리즘 선정)

  • Leem Young-Moon;Choi Yo-Han
    • Journal of the Korea Safety Management & Science
    • /
    • v.7 no.5
    • /
    • pp.1-8
    • /
    • 2005
  • The consequences of rapid industrial advancement, diversified types of business and unexpected industrial accidents have caused a lot of damage to many unspecified persons both in a human way and a material way Although various previous studies have been analyzed to prevent industrial accidents, these studies only provide managerial and educational policies using frequency analysis and comparative analysis based on data from past industrial accidents. The main objective of this study is to find an optimal algorithm for data analysis of industrial accidents and this paper provides a comparative analysis of 4 kinds of algorithms including CHAID, CART, C4.5, and QUEST. Decision tree algorithm is utilized to predict results using objective and quantified data as a typical technique of data mining. Enterprise Miner of SAS and AnswerTree of SPSS will be used to evaluate the validity of the results of the four algorithms. The sample for this work chosen from 19,574 data related to construction industries during three years ($2002\sim2004$) in Korea.