• Title/Summary/Keyword: classification trees

Search Result 313, Processing Time 0.025 seconds

Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm (SVM과 meta-learning algorithm을 이용한 고지혈증 유병 예측모형 개발과 활용)

  • Lee, Seulki;Shin, Taeksoo
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.111-124
    • /
    • 2018
  • This study aims to develop a classification model for predicting the occurrence of hyperlipidemia, one of the chronic diseases. Prior studies applying data mining techniques for predicting disease can be classified into a model design study for predicting cardiovascular disease and a study comparing disease prediction research results. In the case of foreign literatures, studies predicting cardiovascular disease were predominant in predicting disease using data mining techniques. Although domestic studies were not much different from those of foreign countries, studies focusing on hypertension and diabetes were mainly conducted. Since hypertension and diabetes as well as chronic diseases, hyperlipidemia, are also of high importance, this study selected hyperlipidemia as the disease to be analyzed. We also developed a model for predicting hyperlipidemia using SVM and meta learning algorithms, which are already known to have excellent predictive power. In order to achieve the purpose of this study, we used data set from Korea Health Panel 2012. The Korean Health Panel produces basic data on the level of health expenditure, health level and health behavior, and has conducted an annual survey since 2008. In this study, 1,088 patients with hyperlipidemia were randomly selected from the hospitalized, outpatient, emergency, and chronic disease data of the Korean Health Panel in 2012, and 1,088 nonpatients were also randomly extracted. A total of 2,176 people were selected for the study. Three methods were used to select input variables for predicting hyperlipidemia. First, stepwise method was performed using logistic regression. Among the 17 variables, the categorical variables(except for length of smoking) are expressed as dummy variables, which are assumed to be separate variables on the basis of the reference group, and these variables were analyzed. Six variables (age, BMI, education level, marital status, smoking status, gender) excluding income level and smoking period were selected based on significance level 0.1. Second, C4.5 as a decision tree algorithm is used. The significant input variables were age, smoking status, and education level. Finally, C4.5 as a decision tree algorithm is used. In SVM, the input variables selected by genetic algorithms consisted of 6 variables such as age, marital status, education level, economic activity, smoking period, and physical activity status, and the input variables selected by genetic algorithms in artificial neural network consist of 3 variables such as age, marital status, and education level. Based on the selected parameters, we compared SVM, meta learning algorithm and other prediction models for hyperlipidemia patients, and compared the classification performances using TP rate and precision. The main results of the analysis are as follows. First, the accuracy of the SVM was 88.4% and the accuracy of the artificial neural network was 86.7%. Second, the accuracy of classification models using the selected input variables through stepwise method was slightly higher than that of classification models using the whole variables. Third, the precision of artificial neural network was higher than that of SVM when only three variables as input variables were selected by decision trees. As a result of classification models based on the input variables selected through the genetic algorithm, classification accuracy of SVM was 88.5% and that of artificial neural network was 87.9%. Finally, this study indicated that stacking as the meta learning algorithm proposed in this study, has the best performance when it uses the predicted outputs of SVM and MLP as input variables of SVM, which is a meta classifier. The purpose of this study was to predict hyperlipidemia, one of the representative chronic diseases. To do this, we used SVM and meta-learning algorithms, which is known to have high accuracy. As a result, the accuracy of classification of hyperlipidemia in the stacking as a meta learner was higher than other meta-learning algorithms. However, the predictive performance of the meta-learning algorithm proposed in this study is the same as that of SVM with the best performance (88.6%) among the single models. The limitations of this study are as follows. First, various variable selection methods were tried, but most variables used in the study were categorical dummy variables. In the case with a large number of categorical variables, the results may be different if continuous variables are used because the model can be better suited to categorical variables such as decision trees than general models such as neural networks. Despite these limitations, this study has significance in predicting hyperlipidemia with hybrid models such as met learning algorithms which have not been studied previously. It can be said that the result of improving the model accuracy by applying various variable selection techniques is meaningful. In addition, it is expected that our proposed model will be effective for the prevention and management of hyperlipidemia.

Classification of Urban Green Space Using Airborne LiDAR and RGB Ortho Imagery Based on Deep Learning (항공 LiDAR 및 RGB 정사 영상을 이용한 딥러닝 기반의 도시녹지 분류)

  • SON, Bokyung;LEE, Yeonsu;IM, Jungho
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.24 no.3
    • /
    • pp.83-98
    • /
    • 2021
  • Urban green space is an important component for enhancing urban ecosystem health. Thus, identifying the spatial structure of urban green space is required to manage a healthy urban ecosystem. The Ministry of Environment has provided the level 3 land cover map(the highest (1m) spatial resolution map) with a total of 41 classes since 2010. However, specific urban green information such as street trees was identified just as grassland or even not classified them as a vegetated area in the map. Therefore, this study classified detailed urban green information(i.e., tree, shrub, and grass), not included in the existing level 3 land cover map, using two types of high-resolution(<1m) remote sensing data(i.e., airborne LiDAR and RGB ortho imagery) in Suwon, South Korea. U-Net, one of image segmentation deep learning approaches, was adopted to classify detailed urban green space. A total of three classification models(i.e., LRGB10, LRGB5, and RGB5) were proposed depending on the target number of classes and the types of input data. The average overall accuracies for test sites were 83.40% (LRGB10), 89.44%(LRGB5), and 74.76%(RGB5). Among three models, LRGB5, which uses both airborne LiDAR and RGB ortho imagery with 5 target classes(i.e., tree, shrub, grass, building, and the others), resulted in the best performance. The area ratio of total urban green space(based on trees, shrub, and grass information) for the entire Suwon was 45.61%(LRGB10), 43.47%(LRGB5), and 44.22%(RGB5). All models were able to provide additional 13.40% of urban tree information on average when compared to the existing level 3 land cover map. Moreover, these urban green classification results are expected to be utilized in various urban green studies or decision making processes, as it provides detailed information on urban green space.

Characterizing Patterns of Experience of Harmful Shops among Adolescents Using Decision Tree Models (데이터마이닝을 이용한 청소년 유해업소 출입경험에 영향을 주는 요인)

  • Sohn, Aeree
    • Korean Journal of Health Education and Promotion
    • /
    • v.31 no.3
    • /
    • pp.15-26
    • /
    • 2014
  • Objective: This study was conducted in order to explore the predictive model of the experience of harmful shops in middle and high school students. Methods: The survey was conducted using a self-administered questionnaire method online via the homepage of the education ministry's student health information center. Participants were 1,888 middle school students and 1,563 high school students from 107 schools in Korea. The collected data were processed using the SPSS classification trees 18.0 program and examined using data mining decision tree model. Results: In this study, 6.9% of all subjects were found to have been to sex industry harmful place and 81.8% game place. The results revealed that smoking, living with parents, and school grade were significant predictors for experience of sex industry harmful place. The perception of study disrupts, drinking, living with parents, stress, and satisfaction of school life were significant predictors for experience of game harmful place. Conclusions: These results suggest that an educational approach should be developed by tailored conditions to prevent the access to harmful shops.

A comparison of three design tree based search algorithms for the detection of engineering parts constructed with CATIA V5 in large databases

  • Roj, Robin
    • Journal of Computational Design and Engineering
    • /
    • v.1 no.3
    • /
    • pp.161-172
    • /
    • 2014
  • This paper presents three different search engines for the detection of CAD-parts in large databases. The analysis of the contained information is performed by the export of the data that is stored in the structure trees of the CAD-models. A preparation program generates one XML-file for every model, which in addition to including the data of the structure tree, also owns certain physical properties of each part. The first search engine is specializes in the discovery of standard parts, like screws or washers. The second program uses certain user input as search parameters, and therefore has the ability to perform personalized queries. The third one compares one given reference part with all parts in the database, and locates files that are identical, or similar to, the reference part. All approaches run automatically, and have the analysis of the structure tree in common. Files constructed with CATIA V5, and search engines written with Python have been used for the implementation. The paper also includes a short comparison of the advantages and disadvantages of each program, as well as a performance test.

Multispectral Image Data Compression Using Classified Prediction and KLT in Wavelet Transform Domain

  • Kim, Tae-Su;Kim, Seung-Jin;Kim, Byung-Ju;Lee, Jong-Won;Kwon, Seong-Geun;Lee, Kuhn-Il
    • Proceedings of the IEEK Conference
    • /
    • 2002.07a
    • /
    • pp.204-207
    • /
    • 2002
  • The current paper proposes a new multispectral image data compression algorithm that can efficiently reduce spatial and spectral redundancies by applying classified prediction, a Karhunen-Loeve transform (KLT), and the three-dimensional set partitioning in hierarchical trees (3-D SPIHT) algorithm In the wavelet transform (WT) domain. The classification is performed in the WT domain to exploit the interband classified dependency, while the resulting class information is used for the interband prediction. The residual image data on the prediction errors between the original image data and the predicted image data is decorrelated by a KLT. Finally, the 3D-SPIHT algorithm is used to encode the transformed coefficients listed in a descending order spatially and spectrally as a result of the WT and KLT. Simulation results showed that the reconstructed images after using the proposed algorithm exhibited a better quality and higher compression ratio than those using conventional algorithms.

  • PDF

An Empirical Analysis of Boosing of Neural Networks for Bankruptcy Prediction (부스팅 인공신경망학습의 기업부실예측 성과비교)

  • Kim, Myoung-Jong;Kang, Dae-Ki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.14 no.1
    • /
    • pp.63-69
    • /
    • 2010
  • Ensemble is one of widely used methods for improving the performance of classification and prediction models. Two popular ensemble methods, Bagging and Boosting, have been applied with great success to various machine learning problems using mostly decision trees as base classifiers. This paper performs an empirical comparison of Boosted neural networks and traditional neural networks on bankruptcy prediction tasks. Experimental results on Korean firms indicated that the boosted neural networks showed the improved performance over traditional neural networks.

Data Mining for High Dimensional Data in Drug Discovery and Development

  • Lee, Kwan R.;Park, Daniel C.;Lin, Xiwu;Eslava, Sergio
    • Genomics & Informatics
    • /
    • v.1 no.2
    • /
    • pp.65-74
    • /
    • 2003
  • Data mining differs primarily from traditional data analysis on an important dimension, namely the scale of the data. That is the reason why not only statistical but also computer science principles are needed to extract information from large data sets. In this paper we briefly review data mining, its characteristics, typical data mining algorithms, and potential and ongoing applications of data mining at biopharmaceutical industries. The distinguishing characteristics of data mining lie in its understandability, scalability, its problem driven nature, and its analysis of retrospective or observational data in contrast to experimentally designed data. At a high level one can identify three types of problems for which data mining is useful: description, prediction and search. Brief review of data mining algorithms include decision trees and rules, nonlinear classification methods, memory-based methods, model-based clustering, and graphical dependency models. Application areas covered are discovery compound libraries, clinical trial and disease management data, genomics and proteomics, structural databases for candidate drug compounds, and other applications of pharmaceutical relevance.

Data Base on Resources of Mushrooms in Korea

  • Cho, Duck-Hyun;Cho, Won-Kyung
    • Plant Resources
    • /
    • v.4 no.3
    • /
    • pp.153-156
    • /
    • 2001
  • Today information is important for man and total fields. Science field is not exception. Currently information age things of information is only useful for man and total industry. So bioinformation is necessary of biodiversity in broadly wide and detailed information. Among information, bioinformation of biodiversity is important and utilization of living things. Among them, the mushroom(higher fungi) are an important part in ecosystem as a decomposer responsible for recycling materials. Many living things today, however, have endangered by environmental pollution and ecological destruction. The higher fungi also are not exception. Mushroom has been used for food sources, pharmacy and forests resources from ancient times. Among biodiversity, database of mushroom is very necessary for university, institute and industry. This DB contains four items of native mushroom(higher fungi) from Korea. first item contain species, genus, family, order class, ad division according to the classification. Second item contain pharmaceutical purpose, food source, culture, toxic, anti-cancer of the application. Third item contain symbiosis, rotten trees of the ecological resources. Fourth item contain geographical distribution and illustrated literature. Information system is also available using KRISTAL II for searches on the WEB in URL http://ruby. kisti. re. kr/∼mushroom.

  • PDF

Analysis of Some Desert Ecosystems Vegetation in Abu Dhabi Emirate, United Arab Emirates. Effect of Land Use

  • Mousa, Mohamed Taher;Ksiksi, Taoufik Salah
    • Journal of Forest and Environmental Science
    • /
    • v.25 no.1
    • /
    • pp.49-55
    • /
    • 2009
  • The present study analyses the effect of land use on the vegetation of some desert ecosystems in Abu Dhabi, United Arab Emirates (UAE). Three sites were selected to represent different types of land use, inside Umm Al-Banadeq forest, outside the forest and along Abu Dhabi-Al Ain Trucks Road. In total, fifty-two stands were examined; including a matrix of 14 species ${\times}$ 52 stands. Based on species cover data, stands were classified using TWINSPAN and ordinated using DCA. Four vegetation groups were generated at level three of classification. Zygophyllum mandavillei was dominant in most vegetation groups; Heliotropium bacciferum dominated vegetation groups inhabited the forest. Species richness, species turnover, relative evenness and relative concentration of dominance of forest vegetation groups were 2.8, 5.7, 0.7, and 2.0, respectively. The differences were attributed to both natural variability and forestry-induced changes, including change in land use, drainage and ploughing and shading by trees. Vegetation group inhabited Abu Dhabi-Al Ain Trucks Road, that were dominated by Haloxylon salicornicum and Zygophyllum mandavillei have high total cover (8.8 m per $m^{-1}$). Most community and vegetation attributes were significantly higher inside the forest than outside. Human interventions and environmental factors affected species diversity and abundance of these communities.

  • PDF

Machine Learning Approaches to Corn Yield Estimation Using Satellite Images and Climate Data: A Case of Iowa State

  • Kim, Nari;Lee, Yang-Won
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.34 no.4
    • /
    • pp.383-390
    • /
    • 2016
  • Remote sensing data has been widely used in the estimation of crop yields by employing statistical methods such as regression model. Machine learning, which is an efficient empirical method for classification and prediction, is another approach to crop yield estimation. This paper described the corn yield estimation in Iowa State using four machine learning approaches such as SVM (Support Vector Machine), RF (Random Forest), ERT (Extremely Randomized Trees) and DL (Deep Learning). Also, comparisons of the validation statistics among them were presented. To examine the seasonal sensitivities of the corn yields, three period groups were set up: (1) MJJAS (May to September), (2) JA (July and August) and (3) OC (optimal combination of month). In overall, the DL method showed the highest accuracies in terms of the correlation coefficient for the three period groups. The accuracies were relatively favorable in the OC group, which indicates the optimal combination of month can be significant in statistical modeling of crop yields. The differences between our predictions and USDA (United States Department of Agriculture) statistics were about 6-8 %, which shows the machine learning approaches can be a viable option for crop yield modeling. In particular, the DL showed more stable results by overcoming the overfitting problem of generic machine learning methods.