• 제목/요약/키워드: classification tree

Search Result 930, Processing Time 0.029 seconds

Morphogenetic Identification of Eel's Larva (Leptocephalus) Collected by Set net in Namhae, Korea (남해 정치망에서 채집한 엽상자어(Leptocephalus)의 형태 및 유전학적 특성)

  • Chang-Gi Hong;Kyeong-Ho Han
    • Journal of Marine Life Science
    • /
    • v.8 no.2
    • /
    • pp.128-135
    • /
    • 2023
  • The present study was tried to identify whether the eel's larva was close to a conger (Conger myriaster), a pipe conger (Muraenesox cinereus) or four species of Anguilla. Experimental fishes were collected by set net in the gulf of enggang, Namhae, Korea from May to June. Their morphological characteristics were compared with adult fishes of a conger, a pipe conger and four species of Anguilla. For genetic classification, DNA was isolated and amplified by using 12S rRNA and 16S rRNA primer set. The PCR products were direct sequencing in both directions. The nucleotide sequences were analyzed using softwares. As results of morphological measurement on eel's larva, the percentages of head length and preanal length against total length were similar with a conger. Based on the nucleotide sequences, the phylogenetic tree also revealed a close relationship to a conger. Therefore, eel's larva, caught in Namhae from May to June, was identified into a conger's larva.

Investigating the Performance of Bayesian-based Feature Selection and Classification Approach to Social Media Sentiment Analysis (소셜미디어 감성분석을 위한 베이지안 속성 선택과 분류에 대한 연구)

  • Chang Min Kang;Kyun Sun Eo;Kun Chang Lee
    • Information Systems Review
    • /
    • v.24 no.1
    • /
    • pp.1-19
    • /
    • 2022
  • Social media-based communication has become crucial part of our personal and official lives. Therefore, it is no surprise that social media sentiment analysis has emerged an important way of detecting potential customers' sentiment trends for all kinds of companies. However, social media sentiment analysis suffers from huge number of sentiment features obtained in the process of conducting the sentiment analysis. In this sense, this study proposes a novel method by using Bayesian Network. In this model MBFS (Markov Blanket-based Feature Selection) is used to reduce the number of sentiment features. To show the validity of our proposed model, we utilized online review data from Yelp, a famous social media about restaurant, bars, beauty salons evaluation and recommendation. We used a number of benchmarking feature selection methods like correlation-based feature selection, information gain, and gain ratio. A number of machine learning classifiers were also used for our validation tasks, like TAN, NBN, Sons & Spouses BN (Bayesian Network), Augmented Markov Blanket. Furthermore, we conducted Bayesian Network-based what-if analysis to see how the knowledge map between target node and related explanatory nodes could yield meaningful glimpse into what is going on in sentiments underlying the target dataset.

Verification Test of High-Stability SMEs Using Technology Appraisal Items (기술력 평가항목을 이용한 고안정성 중소기업 판별력 검증)

  • Jun-won Lee
    • Information Systems Review
    • /
    • v.20 no.4
    • /
    • pp.79-96
    • /
    • 2018
  • This study started by focusing on the internalization of the technology appraisal model into the credit rating model to increase the discriminative power of the credit rating model not only for SMEs but also for all companies, reflecting the items related to the financial stability of the enterprises among the technology appraisal items. Therefore, it is aimed to verify whether the technology appraisal model can be applied to identify high-stability SMEs in advance. We classified companies into industries (manufacturing vs. non-manufacturing) and the age of company (initial vs. non-initial), and defined as a high-stability company that has achieved an average debt ratio less than 1/2 of the group for three years. The C5.0 was applied to verify the discriminant power of the model. As a result of the analysis, there is a difference in importance according to the type of industry and the age of company at the sub-item level, but in the mid-item level the R&D capability was a key variable for discriminating high-stability SMEs. In the early stage of establishment, the funding capacity (diversification of funding methods, capital structure and capital cost which taking into account profitability) is an important variable in financial stability. However, we concluded that technology development infrastructure, which enables continuous performance as the age of company increase, becomes an important variable affecting financial stability. The classification accuracy of the model according to the age of company and industry is 71~91%, and it is confirmed that it is possible to identify high-stability SMEs by using technology appraisal items.

Clickstream Big Data Mining for Demographics based Digital Marketing (인구통계특성 기반 디지털 마케팅을 위한 클릭스트림 빅데이터 마이닝)

  • Park, Jiae;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.143-163
    • /
    • 2016
  • The demographics of Internet users are the most basic and important sources for target marketing or personalized advertisements on the digital marketing channels which include email, mobile, and social media. However, it gradually has become difficult to collect the demographics of Internet users because their activities are anonymous in many cases. Although the marketing department is able to get the demographics using online or offline surveys, these approaches are very expensive, long processes, and likely to include false statements. Clickstream data is the recording an Internet user leaves behind while visiting websites. As the user clicks anywhere in the webpage, the activity is logged in semi-structured website log files. Such data allows us to see what pages users visited, how long they stayed there, how often they visited, when they usually visited, which site they prefer, what keywords they used to find the site, whether they purchased any, and so forth. For such a reason, some researchers tried to guess the demographics of Internet users by using their clickstream data. They derived various independent variables likely to be correlated to the demographics. The variables include search keyword, frequency and intensity for time, day and month, variety of websites visited, text information for web pages visited, etc. The demographic attributes to predict are also diverse according to the paper, and cover gender, age, job, location, income, education, marital status, presence of children. A variety of data mining methods, such as LSA, SVM, decision tree, neural network, logistic regression, and k-nearest neighbors, were used for prediction model building. However, this research has not yet identified which data mining method is appropriate to predict each demographic variable. Moreover, it is required to review independent variables studied so far and combine them as needed, and evaluate them for building the best prediction model. The objective of this study is to choose clickstream attributes mostly likely to be correlated to the demographics from the results of previous research, and then to identify which data mining method is fitting to predict each demographic attribute. Among the demographic attributes, this paper focus on predicting gender, age, marital status, residence, and job. And from the results of previous research, 64 clickstream attributes are applied to predict the demographic attributes. The overall process of predictive model building is compose of 4 steps. In the first step, we create user profiles which include 64 clickstream attributes and 5 demographic attributes. The second step performs the dimension reduction of clickstream variables to solve the curse of dimensionality and overfitting problem. We utilize three approaches which are based on decision tree, PCA, and cluster analysis. We build alternative predictive models for each demographic variable in the third step. SVM, neural network, and logistic regression are used for modeling. The last step evaluates the alternative models in view of model accuracy and selects the best model. For the experiments, we used clickstream data which represents 5 demographics and 16,962,705 online activities for 5,000 Internet users. IBM SPSS Modeler 17.0 was used for our prediction process, and the 5-fold cross validation was conducted to enhance the reliability of our experiments. As the experimental results, we can verify that there are a specific data mining method well-suited for each demographic variable. For example, age prediction is best performed when using the decision tree based dimension reduction and neural network whereas the prediction of gender and marital status is the most accurate by applying SVM without dimension reduction. We conclude that the online behaviors of the Internet users, captured from the clickstream data analysis, could be well used to predict their demographics, thereby being utilized to the digital marketing.

An Analysis of Vegetation-Environment Relationships of Pinus densiflora for. erecta and Chunyang-type of Pinus densiflora Communities by TWINSPAN and DCCA (TWINSPAN과 DCCA에 의한 금강(金剛)소나무 및 춘양목(春陽木)소나무 군집(群集)과 환경(環境)의 상관관계(相關關係) 분석(分析))

  • Song, Ho Kyung;Kim, Seong Deog;Jang, Kyu Kwan
    • Journal of Korean Society of Forest Science
    • /
    • v.84 no.2
    • /
    • pp.266-274
    • /
    • 1995
  • Vegetational data from 62 quadrats of Pinus densiflora for. erecta and Chunyang-type of Pines densiflora forests were analyzed by using two multivariate methods : TWo-way INdicator Species ANalysis(TWINSPAN) for classification and Detrended Canonical Correspondence Analysis(DCCA) for ordination. The dominant tree species of Pinus densiflora for. erecta communities were found in the order of Pines densiflora for. erecta, Quercus mongolica, Quercus variabilis, Lindera obtusiloba, Fraxinus rhynchophylla, and Rhus trichocapa. The dominant tree species of Chunyang -type of Pinus densiflora communities were Quercue variabilis. Quercue mongolica, Fraxiraus sieboldiana, Styrax obassia, and Quercus serrata. The forest vegetation of Pinus densiflora was classified into Quercars variabilis-Styrax obassia. Quercus variabilis Quercus variabilis-Quercus mongolica, and Quercue mongolica communities according to TWINSPAN. Pinus densiflora for. erecta community was distributed in the good nutrition area of total nitrogen. organic matter, $K^+$, $Ca^{{+}+}$, $Mg^{{+}+}$, and canon exchange capacity, while Chunyang type of Pinus densiflora community in the good nutrition area of $P_2O_5$. The relationship between the distribution of dominant communities for forest vegetation and soil condition in Pinus densiflora communities was investigated by analysing the elevation and soil nutrition gradients. Quercus mongolica community was distributed in the high elevation and good nutrition area of total nitrogen, organic matter, and ration exchange capacity, while Quercus variabilis community was distributed in the low elevation and poor nutrition area of total nitrogen, organic matter, and ration exchange capacity. Quercus variabilis Styrax obassia and Quercus variabilis-Quercus mongolica community was distributed en the medium elevation and medium nutrition area.

  • PDF

Optimal Selection of Classifier Ensemble Using Genetic Algorithms (유전자 알고리즘을 이용한 분류자 앙상블의 최적 선택)

  • Kim, Myung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.16 no.4
    • /
    • pp.99-112
    • /
    • 2010
  • Ensemble learning is a method for improving the performance of classification and prediction algorithms. It is a method for finding a highly accurateclassifier on the training set by constructing and combining an ensemble of weak classifiers, each of which needs only to be moderately accurate on the training set. Ensemble learning has received considerable attention from machine learning and artificial intelligence fields because of its remarkable performance improvement and flexible integration with the traditional learning algorithms such as decision tree (DT), neural networks (NN), and SVM, etc. In those researches, all of DT ensemble studies have demonstrated impressive improvements in the generalization behavior of DT, while NN and SVM ensemble studies have not shown remarkable performance as shown in DT ensembles. Recently, several works have reported that the performance of ensemble can be degraded where multiple classifiers of an ensemble are highly correlated with, and thereby result in multicollinearity problem, which leads to performance degradation of the ensemble. They have also proposed the differentiated learning strategies to cope with performance degradation problem. Hansen and Salamon (1990) insisted that it is necessary and sufficient for the performance enhancement of an ensemble that the ensemble should contain diverse classifiers. Breiman (1996) explored that ensemble learning can increase the performance of unstable learning algorithms, but does not show remarkable performance improvement on stable learning algorithms. Unstable learning algorithms such as decision tree learners are sensitive to the change of the training data, and thus small changes in the training data can yield large changes in the generated classifiers. Therefore, ensemble with unstable learning algorithms can guarantee some diversity among the classifiers. To the contrary, stable learning algorithms such as NN and SVM generate similar classifiers in spite of small changes of the training data, and thus the correlation among the resulting classifiers is very high. This high correlation results in multicollinearity problem, which leads to performance degradation of the ensemble. Kim,s work (2009) showedthe performance comparison in bankruptcy prediction on Korea firms using tradition prediction algorithms such as NN, DT, and SVM. It reports that stable learning algorithms such as NN and SVM have higher predictability than the unstable DT. Meanwhile, with respect to their ensemble learning, DT ensemble shows the more improved performance than NN and SVM ensemble. Further analysis with variance inflation factor (VIF) analysis empirically proves that performance degradation of ensemble is due to multicollinearity problem. It also proposes that optimization of ensemble is needed to cope with such a problem. This paper proposes a hybrid system for coverage optimization of NN ensemble (CO-NN) in order to improve the performance of NN ensemble. Coverage optimization is a technique of choosing a sub-ensemble from an original ensemble to guarantee the diversity of classifiers in coverage optimization process. CO-NN uses GA which has been widely used for various optimization problems to deal with the coverage optimization problem. The GA chromosomes for the coverage optimization are encoded into binary strings, each bit of which indicates individual classifier. The fitness function is defined as maximization of error reduction and a constraint of variance inflation factor (VIF), which is one of the generally used methods to measure multicollinearity, is added to insure the diversity of classifiers by removing high correlation among the classifiers. We use Microsoft Excel and the GAs software package called Evolver. Experiments on company failure prediction have shown that CO-NN is effectively applied in the stable performance enhancement of NNensembles through the choice of classifiers by considering the correlations of the ensemble. The classifiers which have the potential multicollinearity problem are removed by the coverage optimization process of CO-NN and thereby CO-NN has shown higher performance than a single NN classifier and NN ensemble at 1% significance level, and DT ensemble at 5% significance level. However, there remain further research issues. First, decision optimization process to find optimal combination function should be considered in further research. Secondly, various learning strategies to deal with data noise should be introduced in more advanced further researches in the future.

Vegetative Propagation and Morphological Characteristics of Amelanchier spp. with High Value as Fruit Tree for Landscaping (정원용 유실수로서 가치가 높은 채진목속(Amelanchier spp.)의 형태적 특성 및 영양번식방법)

  • Kang, Ho Chul;Hwang, Dae Yul;Ha, Yoo Mi
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.46 no.6
    • /
    • pp.111-119
    • /
    • 2018
  • This study was carried out to investigate the growth characteristics and propagation methods of the Korean native Amelanchier asiatica, A. arborea, and A. alnifolia as fruit trees for gardens. Due to the lack of recent research on Amelanchier spp., their superficial classification is still unclear and the names are being used interchangeably. The results are obtained as follows : A. arborea and A. alnifolia were globular type multi-stemmed shrubs. A 20-year-old tree of A. asiatica was 7.8m in height, with a 5.2m crown width, with one trunk. As for the morphological characteristics, leaves of A. asiatica were oblong, with an acuminate of, 6.1cm and 3.6cm width, but A. arborea and A. alnifolia had acute obovate leaves. The leaf size of A. alnifolia was the largest among the three species. The flower size of A. asiatica was bigger than that of A. arborea and A. alnifolia. In addition, its petals and flower clusters were also the largest among the three species. The flowering of A. asiatica initiated on April 21 and then bloomed for a duration of 24 days in Osan, while that of A. arborea and A. alnifolia initiated flowering on April 12 and then bloomed for a duration of 22 days in the same location. The fruit of A. arborea and A. alnifolia were green on May 10~12, it changed into purplish red on May 24~26, and its matured on June 1~3. The duration of fruit persistence of A. arborea and A. alnifolia were 48~50 days. On the other hand, A. asiatica showed greenish fruit on May 20, it became red on September 4, and had fallen by October 3. The fruit size was the largest at 1.03cm of height and 1.12cm of diameter in the A. arborea, followed by the big berry of A. alnifolia and the smallest fruit in the native, A. asiatica. It was difficult to root due to the hardwood cutting of A. arborea at a 40% rate of rooting. In the softwood cutting, the rooting rate of A. arborea was increased by the treatment with concentrated IBA, especially at 5,000 and 7,000ppm. The optimum date for cutting was on June 27, when the rooting rate was more than 80%. The most effective method for rooting of A. arborea was rootone or 7,000 ppm IBA treatment on June 27 softwood cuttings, which showed a rooting rate of over 80%.

Spectral Band Selection for Detecting Fire Blight Disease in Pear Trees by Narrowband Hyperspectral Imagery (초분광 이미지를 이용한 배나무 화상병에 대한 최적 분광 밴드 선정)

  • Kang, Ye-Seong;Park, Jun-Woo;Jang, Si-Hyeong;Song, Hye-Young;Kang, Kyung-Suk;Ryu, Chan-Seok;Kim, Seong-Heon;Jun, Sae-Rom;Kang, Tae-Hwan;Kim, Gul-Hwan
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.23 no.1
    • /
    • pp.15-33
    • /
    • 2021
  • In this study, the possibility of discriminating Fire blight (FB) infection tested using the hyperspectral imagery. The reflectance of healthy and infected leaves and branches was acquired with 5 nm of full width at high maximum (FWHM) and then it was standardized to 10 nm, 25 nm, 50 nm, and 80 nm of FWHM. The standardized samples were divided into training and test sets at ratios of 7:3, 5:5 and 3:7 to find the optimal bands of FWHM by the decision tree analysis. Classification accuracy was evaluated using overall accuracy (OA) and kappa coefficient (KC). The hyperspectral reflectance of infected leaves and branches was significantly lower than those of healthy green, red-edge (RE) and near infrared (NIR) regions. The bands selected for the first node were generally 750 and 800 nm; these were used to identify the infection of leaves and branches, respectively. The accuracy of the classifier was higher in the 7:3 ratio. Four bands with 50 nm of FWHM (450, 650, 750, and 950 nm) might be reasonable because the difference in the recalculated accuracy between 8 bands with 10 nm of FWHM (440, 580, 640, 660, 680, 710, 730, and 740 nm) and 4 bands was only 1.8% for OA and 4.1% for KC, respectively. Finally, adding two bands (550 nm and 800 nm with 25 nm of FWHM) in four bands with 50 nm of FWHM have been proposed to improve the usability of multispectral image sensors with performing various roles in agriculture as well as detecting FB with other combinations of spectral bands.

A prediction model for adolescents' skipping breakfast using the CART algorithm for decision trees: 7th (2016-2018) Korea National Health and Nutrition Examination Survey (의사결정나무 CART 알고리즘을 이용한 청소년 아침결식 예측 모형: 제7기 (2016-2018년) 국민건강영양조사 자료분석)

  • Sun A Choi;Sung Suk Chung;Jeong Ok Rho
    • Journal of Nutrition and Health
    • /
    • v.56 no.3
    • /
    • pp.300-314
    • /
    • 2023
  • Purpose: This study sought to predict the reasons for skipping breakfast by adolescents aged 13-18 years using the 7th Korea National Health and Nutrition Examination Survey (KNHANES). Methods: The participants included 1,024 adolescents. The data were analyzed using a complex-sample t-test, the Rao Scott χ2-test, and the classification and regression tree (CART) algorithm for decision tree analysis with SPSS v. 27.0. The participants were divided into two groups, one regularly eating breakfast and the other skipping it. Results: A total of 579 and 445 study participants were found to be breakfast consumers and breakfast skippers respectively. Breakfast consumers were significantly younger than those who skipped breakfast. In addition, breakfast consumers had a significantly higher frequency of eating dinner, had been taught about nutrition, and had a lower frequency of eating out. The breakfast skippers did so to lose weight. Children who skipped breakfast consumed less energy, carbohydrates, proteins, fats, fiber, cholesterol, vitamin C, vitamin A, calcium, vitamin B1, vitamin B2, phosphorus, sodium, iron, potassium, and niacin than those who consumed breakfast. The best predictor of skipping breakfast was identifying adolescents who sought to control their weight by not eating meals. Other participants who had low and middle-low household incomes, ate dinner 3-4 times a week, were more than 14.5 years old, and ate out once a day showed a higher frequency of skipping breakfast. Conclusion: Based on these results, nutrition education targeted at losing weight correctly and emphasizing the importance of breakfast, especially for adolescents, is required. Moreover, nutrition educators should consider designing and implementing specific action plans to encourage adolescents to improve their breakfast-eating practices by also eating dinner regularly and reducing eating out.

Exploring Branch Structure across Branch Orders and Species Using Terrestrial Laser Scanning and Quantitative Structure Model (지상형 라이다와 정량적 구조 모델을 이용한 분기별, 종별 나무의 가지 구조 탐구)

  • Seongwoo Jo;Tackang Yang
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.26 no.1
    • /
    • pp.31-52
    • /
    • 2024
  • Considering the significant relationship between a tree's branch structure and physiology, understanding the detailed branch structure is crucial for fields such as species classification, and 3D tree modelling. Recently, terrestrial laser scanning (TLS) and quantitative structure model (QSM) have enhanced the understanding of branch structures by capturing the radius, length, and branching angle of branches. Previous studies examining branch structure with TL S and QSM often relied on mean or median of branch structure parameters, such as the radius ratio and length ratio in parent-child relationships, as representative values. Additionally, these studies have typically focused on the relationship between trunk and the first order branches. This study aims to explore the distribution of branch structure parameters up to the third order in Aesculus hippocastanum, Ginkgo biloba, and Prunus yedoensis. The gamma distribution best represented the distributions of branch structure parameters, as evidenced by the average of Kolmogorov-Smirnov statistics (radius = 0.048; length = 0.061; angle = 0.050). Comparisons of the mode, mean, and median were conducted to determine the most representative measure indicating the central tendency of branch structure parameters. The estimated distributions showed differences between the mode and mean (average of normalized differences for radius ratio = 11.2%; length ratio = 17.0%; branching angle = 8.2%), and between the mode and median (radius ratio = 7.5%; length ratio = 11.5%; branching angle = 5.5%). Comparisons of the estimated distributions across branch orders and species were conducted, showing variations across branch orders and species. This study suggests that examining the estimated distribution of the branch structure parameter offers a more detailed description of branch structure, capturing the central tendencies of branch structure parameters. We also emphasize the importance of examining higher branch orders to gain a comprehensive understanding of branch structure, highlighting the differences across branch orders.