• 제목/요약/키워드: classification trees

검색결과 313건 처리시간 0.027초

SVM과 meta-learning algorithm을 이용한 고지혈증 유병 예측모형 개발과 활용 (Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm)

  • 이슬기;신택수
    • 지능정보연구
    • /
    • 제24권2호
    • /
    • pp.111-124
    • /
    • 2018
  • 본 연구는 만성질환 중의 하나인 고지혈증 유병을 예측하는 분류모형을 개발하고자 한다. 이를 위해 SVM과 meta-learning 알고리즘을 이용하여 성과를 비교하였다. 또한 각 알고리즘에서 성과를 향상시키기 위해 변수선정 방법을 통해 유의한 변수만을 선정하여 투입하여 분석하였고 이 결과 역시 각각 성과를 비교하였다. 본 연구목적을 달성하기 위해 한국의료패널 2012년 자료를 이용하였고, 변수 선정을 위해 세 가지 방법을 사용하였다. 먼저 단계적 회귀분석(stepwise regression)을 실시하였다. 둘째, 의사결정나무(decision tree) 알고리즘을 사용하였다. 마지막으로 유전자 알고리즘을 사용하여 변수를 선정하였다. 한편, 이렇게 선정된 변수를 기준으로 SVM, meta-learning 알고리즘 등을 이용하여 고지혈증 환자분류 예측모형을 비교하였고, TP rate, precision 등을 사용하여 분류 성과를 비교분석하였다. 이에 대한 분석결과는 다음과 같다. 첫째, 모든 변수를 투입하여 분류한 결과 SVM의 정확도는 88.4%, 인공신경망의 정확도는 86.7%로 SVM의 정확도가 좀 더 높았다. 둘째, stepwise를 통해 선정된 변수만을 투입하여 분류한 결과 전체 변수를 투입하였을 때보다 각각 정확도가 약간 높았다. 셋째, 의사결정나무에 의해 선정된 변수 3개만을 투입하였을 때 인공신경망의 정확도가 SVM보다 높았다. 유전자 알고리즘을 통해 선정된 변수를 투입하여 분류한 결과 SVM은 88.5%, 인공신경망은 87.9%의 분류 정확도를 보여 주었다. 마지막으로, 본 연구에서 제안하는 meta-learning 알고리즘인 스태킹(stacking)을 적용한 결과로서, SVM과 MLP의 예측결과를 메타 분류기인 SVM의 입력변수로 사용하여 예측한 결과, 고지혈증 분류 정확도가 meta-learning 알고리즘 중에서는 가장 높은 것으로 나타났다.

항공 LiDAR 및 RGB 정사 영상을 이용한 딥러닝 기반의 도시녹지 분류 (Classification of Urban Green Space Using Airborne LiDAR and RGB Ortho Imagery Based on Deep Learning)

  • 손보경;이연수;임정호
    • 한국지리정보학회지
    • /
    • 제24권3호
    • /
    • pp.83-98
    • /
    • 2021
  • 도시녹지는 도시 생태계 건강성 증진을 위한 중요한 요소이며, 건강한 도시 생태계 유지 및 관리를 위해서는 도시녹지의 공간적인 현황 파악이 필요하다. 환경부에서는 2010년 이후부터 총 41개의 분류 항목을 갖는 1m 급 해상도의 세분류 토지피복지도를 제공해오고 있으나, 가로수와 같은 도시 내 고해상도 상세 녹지 정보는 기타 초지로 분류되거나 누락되어 오고 있다. 따라서, 본 연구에서는 수원시 지역을 대상으로 1m 이하 급의 고해상도 원격탐사 자료(항공 LiDAR 및 RGB 정사영상)를 이용하여, 기존 세분류 토지피복지도에서는 나타나지 않는 고해상도의 상세 도시 녹지(수목, 관목 및 초지) 정보를 분류하고자 하였다. 분류 기법으로는 딥러닝 기반의 이미지 분할방법인 U-Net 구조의 모델을 활용하였으며, 분류 항목의 수 및 사용하는 자료의 종류에 따라 총 3가지의 모델(LRGB10, LRGB5, 및 RGB5)을 제안하고 성능을 평가하였다. 검증 지역에 대한 세 모델의 평균 전체 정확도는 각 83.40%(LRGB10), 89.44%(LRGB5), 74.76%(RGB5)이며, 항공 LiDAR와 RGB 정사영상을 함께 사용하여 총 5개의 항목(수목, 관목, 초지, 건물, 및 그 외)을 분류하는 LRGB5 모델의 성능이 가장 높게 나타났다. 수원시의 수목, 관목 및 초지 기준의 전체 녹지 현황은 각 45.61%(LRGB10), 43.47%(LRGB5), 및 44.22%(RGB5)로 나타났으며, 세 모델 모두 기존 세분류 토지피복지도와 비교하여 평균 13.40%의 도시 수목 정보를 더 제공할 수 있는 것으로 나타났다. 더불어 이러한 도시녹지 분류 결과는 향후 중분류 토지피복지도와 같은 기존 GIS 정보와의 융합을 통해 가로수 녹지 비율 현황 등 추가적인 상세 녹지 현황 정보를 제공할 수 있어, 다양한 도시녹지 연구 및 정책의 기초 자료로 활용될 수 있을 것으로 기대된다.

데이터마이닝을 이용한 청소년 유해업소 출입경험에 영향을 주는 요인 (Characterizing Patterns of Experience of Harmful Shops among Adolescents Using Decision Tree Models)

  • 손애리
    • 보건교육건강증진학회지
    • /
    • 제31권3호
    • /
    • pp.15-26
    • /
    • 2014
  • Objective: This study was conducted in order to explore the predictive model of the experience of harmful shops in middle and high school students. Methods: The survey was conducted using a self-administered questionnaire method online via the homepage of the education ministry's student health information center. Participants were 1,888 middle school students and 1,563 high school students from 107 schools in Korea. The collected data were processed using the SPSS classification trees 18.0 program and examined using data mining decision tree model. Results: In this study, 6.9% of all subjects were found to have been to sex industry harmful place and 81.8% game place. The results revealed that smoking, living with parents, and school grade were significant predictors for experience of sex industry harmful place. The perception of study disrupts, drinking, living with parents, stress, and satisfaction of school life were significant predictors for experience of game harmful place. Conclusions: These results suggest that an educational approach should be developed by tailored conditions to prevent the access to harmful shops.

A comparison of three design tree based search algorithms for the detection of engineering parts constructed with CATIA V5 in large databases

  • Roj, Robin
    • Journal of Computational Design and Engineering
    • /
    • 제1권3호
    • /
    • pp.161-172
    • /
    • 2014
  • This paper presents three different search engines for the detection of CAD-parts in large databases. The analysis of the contained information is performed by the export of the data that is stored in the structure trees of the CAD-models. A preparation program generates one XML-file for every model, which in addition to including the data of the structure tree, also owns certain physical properties of each part. The first search engine is specializes in the discovery of standard parts, like screws or washers. The second program uses certain user input as search parameters, and therefore has the ability to perform personalized queries. The third one compares one given reference part with all parts in the database, and locates files that are identical, or similar to, the reference part. All approaches run automatically, and have the analysis of the structure tree in common. Files constructed with CATIA V5, and search engines written with Python have been used for the implementation. The paper also includes a short comparison of the advantages and disadvantages of each program, as well as a performance test.

Multispectral Image Data Compression Using Classified Prediction and KLT in Wavelet Transform Domain

  • Kim, Tae-Su;Kim, Seung-Jin;Kim, Byung-Ju;Lee, Jong-Won;Kwon, Seong-Geun;Lee, Kuhn-Il
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2002년도 ITC-CSCC -1
    • /
    • pp.204-207
    • /
    • 2002
  • The current paper proposes a new multispectral image data compression algorithm that can efficiently reduce spatial and spectral redundancies by applying classified prediction, a Karhunen-Loeve transform (KLT), and the three-dimensional set partitioning in hierarchical trees (3-D SPIHT) algorithm In the wavelet transform (WT) domain. The classification is performed in the WT domain to exploit the interband classified dependency, while the resulting class information is used for the interband prediction. The residual image data on the prediction errors between the original image data and the predicted image data is decorrelated by a KLT. Finally, the 3D-SPIHT algorithm is used to encode the transformed coefficients listed in a descending order spatially and spectrally as a result of the WT and KLT. Simulation results showed that the reconstructed images after using the proposed algorithm exhibited a better quality and higher compression ratio than those using conventional algorithms.

  • PDF

부스팅 인공신경망학습의 기업부실예측 성과비교 (An Empirical Analysis of Boosing of Neural Networks for Bankruptcy Prediction)

  • 김명종;강대기
    • 한국정보통신학회논문지
    • /
    • 제14권1호
    • /
    • pp.63-69
    • /
    • 2010
  • 최근 기계학습 분야에서 분류자의 정확도 개선을 위하여 제안된 다양한 방법들 중 가장 큰 주목을 받고 있는 학습방법 중 하나는 앙상블 학습이다. 그러나 앙상블 학습은 의사결정트리와 같이 불안정한 학습 알고리즘의 성과 개선 효과는 탁월한 반면, 인공신경망과 같이 안정적인 학습알고리즘의 성과 개선 효과는 응용 분야와 구현 방법에 따라 서로 상반된 결론들을 보여주고 있다. 본 연구에서는 국내 기업의 부실화 예측문제를 활용하여 인공신경 망 분류자 및 대표적 앙상블 학습기법인 부스팅 분류자를 적용한 결과 앙상블 학습은 기업부실 예측문제에 있어 전통적 인공신경망의 성과를 개선할 수 있음을 검증하였다.

Data Mining for High Dimensional Data in Drug Discovery and Development

  • Lee, Kwan R.;Park, Daniel C.;Lin, Xiwu;Eslava, Sergio
    • Genomics & Informatics
    • /
    • 제1권2호
    • /
    • pp.65-74
    • /
    • 2003
  • Data mining differs primarily from traditional data analysis on an important dimension, namely the scale of the data. That is the reason why not only statistical but also computer science principles are needed to extract information from large data sets. In this paper we briefly review data mining, its characteristics, typical data mining algorithms, and potential and ongoing applications of data mining at biopharmaceutical industries. The distinguishing characteristics of data mining lie in its understandability, scalability, its problem driven nature, and its analysis of retrospective or observational data in contrast to experimentally designed data. At a high level one can identify three types of problems for which data mining is useful: description, prediction and search. Brief review of data mining algorithms include decision trees and rules, nonlinear classification methods, memory-based methods, model-based clustering, and graphical dependency models. Application areas covered are discovery compound libraries, clinical trial and disease management data, genomics and proteomics, structural databases for candidate drug compounds, and other applications of pharmaceutical relevance.

Data Base on Resources of Mushrooms in Korea

  • Cho, Duck-Hyun;Cho, Won-Kyung
    • Plant Resources
    • /
    • 제4권3호
    • /
    • pp.153-156
    • /
    • 2001
  • Today information is important for man and total fields. Science field is not exception. Currently information age things of information is only useful for man and total industry. So bioinformation is necessary of biodiversity in broadly wide and detailed information. Among information, bioinformation of biodiversity is important and utilization of living things. Among them, the mushroom(higher fungi) are an important part in ecosystem as a decomposer responsible for recycling materials. Many living things today, however, have endangered by environmental pollution and ecological destruction. The higher fungi also are not exception. Mushroom has been used for food sources, pharmacy and forests resources from ancient times. Among biodiversity, database of mushroom is very necessary for university, institute and industry. This DB contains four items of native mushroom(higher fungi) from Korea. first item contain species, genus, family, order class, ad division according to the classification. Second item contain pharmaceutical purpose, food source, culture, toxic, anti-cancer of the application. Third item contain symbiosis, rotten trees of the ecological resources. Fourth item contain geographical distribution and illustrated literature. Information system is also available using KRISTAL II for searches on the WEB in URL http://ruby. kisti. re. kr/∼mushroom.

  • PDF

Analysis of Some Desert Ecosystems Vegetation in Abu Dhabi Emirate, United Arab Emirates. Effect of Land Use

  • Mousa, Mohamed Taher;Ksiksi, Taoufik Salah
    • Journal of Forest and Environmental Science
    • /
    • 제25권1호
    • /
    • pp.49-55
    • /
    • 2009
  • The present study analyses the effect of land use on the vegetation of some desert ecosystems in Abu Dhabi, United Arab Emirates (UAE). Three sites were selected to represent different types of land use, inside Umm Al-Banadeq forest, outside the forest and along Abu Dhabi-Al Ain Trucks Road. In total, fifty-two stands were examined; including a matrix of 14 species ${\times}$ 52 stands. Based on species cover data, stands were classified using TWINSPAN and ordinated using DCA. Four vegetation groups were generated at level three of classification. Zygophyllum mandavillei was dominant in most vegetation groups; Heliotropium bacciferum dominated vegetation groups inhabited the forest. Species richness, species turnover, relative evenness and relative concentration of dominance of forest vegetation groups were 2.8, 5.7, 0.7, and 2.0, respectively. The differences were attributed to both natural variability and forestry-induced changes, including change in land use, drainage and ploughing and shading by trees. Vegetation group inhabited Abu Dhabi-Al Ain Trucks Road, that were dominated by Haloxylon salicornicum and Zygophyllum mandavillei have high total cover (8.8 m per $m^{-1}$). Most community and vegetation attributes were significantly higher inside the forest than outside. Human interventions and environmental factors affected species diversity and abundance of these communities.

  • PDF

Machine Learning Approaches to Corn Yield Estimation Using Satellite Images and Climate Data: A Case of Iowa State

  • Kim, Nari;Lee, Yang-Won
    • 한국측량학회지
    • /
    • 제34권4호
    • /
    • pp.383-390
    • /
    • 2016
  • Remote sensing data has been widely used in the estimation of crop yields by employing statistical methods such as regression model. Machine learning, which is an efficient empirical method for classification and prediction, is another approach to crop yield estimation. This paper described the corn yield estimation in Iowa State using four machine learning approaches such as SVM (Support Vector Machine), RF (Random Forest), ERT (Extremely Randomized Trees) and DL (Deep Learning). Also, comparisons of the validation statistics among them were presented. To examine the seasonal sensitivities of the corn yields, three period groups were set up: (1) MJJAS (May to September), (2) JA (July and August) and (3) OC (optimal combination of month). In overall, the DL method showed the highest accuracies in terms of the correlation coefficient for the three period groups. The accuracies were relatively favorable in the OC group, which indicates the optimal combination of month can be significant in statistical modeling of crop yields. The differences between our predictions and USDA (United States Department of Agriculture) statistics were about 6-8 %, which shows the machine learning approaches can be a viable option for crop yield modeling. In particular, the DL showed more stable results by overcoming the overfitting problem of generic machine learning methods.