• Title/Summary/Keyword: Classification and regression tree

Search Result 211, Processing Time 0.024 seconds

A Study on the Effect of the Document Summarization Technique on the Fake News Detection Model (문서 요약 기법이 가짜 뉴스 탐지 모형에 미치는 영향에 관한 연구)

  • Shim, Jae-Seung;Won, Ha-Ram;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.201-220
    • /
    • 2019
  • Fake news has emerged as a significant issue over the last few years, igniting discussions and research on how to solve this problem. In particular, studies on automated fact-checking and fake news detection using artificial intelligence and text analysis techniques have drawn attention. Fake news detection research entails a form of document classification; thus, document classification techniques have been widely used in this type of research. However, document summarization techniques have been inconspicuous in this field. At the same time, automatic news summarization services have become popular, and a recent study found that the use of news summarized through abstractive summarization has strengthened the predictive performance of fake news detection models. Therefore, the need to study the integration of document summarization technology in the domestic news data environment has become evident. In order to examine the effect of extractive summarization on the fake news detection model, we first summarized news articles through extractive summarization. Second, we created a summarized news-based detection model. Finally, we compared our model with the full-text-based detection model. The study found that BPN(Back Propagation Neural Network) and SVM(Support Vector Machine) did not exhibit a large difference in performance; however, for DT(Decision Tree), the full-text-based model demonstrated a somewhat better performance. In the case of LR(Logistic Regression), our model exhibited the superior performance. Nonetheless, the results did not show a statistically significant difference between our model and the full-text-based model. Therefore, when the summary is applied, at least the core information of the fake news is preserved, and the LR-based model can confirm the possibility of performance improvement. This study features an experimental application of extractive summarization in fake news detection research by employing various machine-learning algorithms. The study's limitations are, essentially, the relatively small amount of data and the lack of comparison between various summarization technologies. Therefore, an in-depth analysis that applies various analytical techniques to a larger data volume would be helpful in the future.

A Study on the Effect of Network Centralities on Recommendation Performance (네트워크 중심성 척도가 추천 성능에 미치는 영향에 대한 연구)

  • Lee, Dongwon
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.23-46
    • /
    • 2021
  • Collaborative filtering, which is often used in personalization recommendations, is recognized as a very useful technique to find similar customers and recommend products to them based on their purchase history. However, the traditional collaborative filtering technique has raised the question of having difficulty calculating the similarity for new customers or products due to the method of calculating similaritiesbased on direct connections and common features among customers. For this reason, a hybrid technique was designed to use content-based filtering techniques together. On the one hand, efforts have been made to solve these problems by applying the structural characteristics of social networks. This applies a method of indirectly calculating similarities through their similar customers placed between them. This means creating a customer's network based on purchasing data and calculating the similarity between the two based on the features of the network that indirectly connects the two customers within this network. Such similarity can be used as a measure to predict whether the target customer accepts recommendations. The centrality metrics of networks can be utilized for the calculation of these similarities. Different centrality metrics have important implications in that they may have different effects on recommended performance. In this study, furthermore, the effect of these centrality metrics on the performance of recommendation may vary depending on recommender algorithms. In addition, recommendation techniques using network analysis can be expected to contribute to increasing recommendation performance even if they apply not only to new customers or products but also to entire customers or products. By considering a customer's purchase of an item as a link generated between the customer and the item on the network, the prediction of user acceptance of recommendation is solved as a prediction of whether a new link will be created between them. As the classification models fit the purpose of solving the binary problem of whether the link is engaged or not, decision tree, k-nearest neighbors (KNN), logistic regression, artificial neural network, and support vector machine (SVM) are selected in the research. The data for performance evaluation used order data collected from an online shopping mall over four years and two months. Among them, the previous three years and eight months constitute social networks composed of and the experiment was conducted by organizing the data collected into the social network. The next four months' records were used to train and evaluate recommender models. Experiments with the centrality metrics applied to each model show that the recommendation acceptance rates of the centrality metrics are different for each algorithm at a meaningful level. In this work, we analyzed only four commonly used centrality metrics: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Eigenvector centrality records the lowest performance in all models except support vector machines. Closeness centrality and betweenness centrality show similar performance across all models. Degree centrality ranking moderate across overall models while betweenness centrality always ranking higher than degree centrality. Finally, closeness centrality is characterized by distinct differences in performance according to the model. It ranks first in logistic regression, artificial neural network, and decision tree withnumerically high performance. However, it only records very low rankings in support vector machine and K-neighborhood with low-performance levels. As the experiment results reveal, in a classification model, network centrality metrics over a subnetwork that connects the two nodes can effectively predict the connectivity between two nodes in a social network. Furthermore, each metric has a different performance depending on the classification model type. This result implies that choosing appropriate metrics for each algorithm can lead to achieving higher recommendation performance. In general, betweenness centrality can guarantee a high level of performance in any model. It would be possible to consider the introduction of proximity centrality to obtain higher performance for certain models.

A Hybrid Under-sampling Approach for Better Bankruptcy Prediction (부도예측 개선을 위한 하이브리드 언더샘플링 접근법)

  • Kim, Taehoon;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.173-190
    • /
    • 2015
  • The purpose of this study is to improve bankruptcy prediction models by using a novel hybrid under-sampling approach. Most prior studies have tried to enhance the accuracy of bankruptcy prediction models by improving the classification methods involved. In contrast, we focus on appropriate data preprocessing as a means of enhancing accuracy. In particular, we aim to develop an effective sampling approach for bankruptcy prediction, since most prediction models suffer from class imbalance problems. The approach proposed in this study is a hybrid under-sampling method that combines the k-Reverse Nearest Neighbor (k-RNN) and one-class support vector machine (OCSVM) approaches. k-RNN can effectively eliminate outliers, while OCSVM contributes to the selection of informative training samples from majority class data. To validate our proposed approach, we have applied it to data from H Bank's non-external auditing companies in Korea, and compared the performances of the classifiers with the proposed under-sampling and random sampling data. The empirical results show that the proposed under-sampling approach generally improves the accuracy of classifiers, such as logistic regression, discriminant analysis, decision tree, and support vector machines. They also show that the proposed under-sampling approach reduces the risk of false negative errors, which lead to higher misclassification costs.

Calpain-10 SNP43 and SNP19 Polymorphisms and Colorectal Cancer: a Matched Case-control Study

  • Hu, Xiao-Qin;Yuan, Ping;Luan, Rong-Sheng;Li, Xiao-Ling;Liu, Wen-Hui;Feng, Fei;Yan, Jin;Yang, Yan-Fang
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.14 no.11
    • /
    • pp.6673-6680
    • /
    • 2013
  • Objective: Insulin resistance (IR) is an established risk factor for colorectal cancer (CRC). Given that CRC and IR physiologically overlap and the calpain-10 gene (CAPN10) is a candidate for IR, we explored the association between CAPN10 and CRC risk. Methods: Blood samples of 400 case-control pairs were genotyped, and the lifestyle and dietary habits of these pairs were recorded and collected. Unconditional logistic regression (LR) was used to assess the effects of CAPN10 SNP43 and SNP19, and environmental factors. Both generalized multifactor dimensionality reduction (GMDR) and the classification and regression tree (CART) were used to test gene-environment interactions for CRC risk. Results: The GA+AA genotype of SNP43 and the Del/Ins+Ins/Ins genotype of SNP19 were marginally related to CRC risk (GA+AA: OR = 1.35, 95% CI = 0.92-1.99; Del/Ins+Ins/Ins: OR = 1.31, 95% CI = 0.84-2.04). Notably, a high-order interaction was consistently identified by GMDR and CART analyses. In GMDR, the four-factor interaction model of SNP43, SNP19, red meat consumption, and smoked meat consumption was the best model, with a maximum cross-validation consistency of 10/10 and testing balance accuracy of 0.61 (P < 0.01). In LR, subjects with high red and smoked meat consumption and two risk genotypes had a 6.17-fold CRC risk (95% CI = 2.44-15.6) relative to that of subjects with low red and smoked meat consumption and null risk genotypes. In CART, individuals with high smoked and red meat consumption, SNP19 Del/Ins+Ins/Ins, and SNP43 GA+AA had higher CRC risk (OR = 4.56, 95%CI = 1.94-10.75) than those with low smoked and red meat consumption. Conclusions: Though the single loci of CAPN10 SNP43 and SNP19 are not enough to significantly increase the CRC susceptibility, the combination of SNP43, SNP19, red meat consumption, and smoked meat consumption is associated with elevated risk.

An Optimized Combination of π-fuzzy Logic and Support Vector Machine for Stock Market Prediction (주식 시장 예측을 위한 π-퍼지 논리와 SVM의 최적 결합)

  • Dao, Tuanhung;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.4
    • /
    • pp.43-58
    • /
    • 2014
  • As the use of trading systems has increased rapidly, many researchers have become interested in developing effective stock market prediction models using artificial intelligence techniques. Stock market prediction involves multifaceted interactions between market-controlling factors and unknown random processes. A successful stock prediction model achieves the most accurate result from minimum input data with the least complex model. In this research, we develop a combination model of ${\pi}$-fuzzy logic and support vector machine (SVM) models, using a genetic algorithm to optimize the parameters of the SVM and ${\pi}$-fuzzy functions, as well as feature subset selection to improve the performance of stock market prediction. To evaluate the performance of our proposed model, we compare the performance of our model to other comparative models, including the logistic regression, multiple discriminant analysis, classification and regression tree, artificial neural network, SVM, and fuzzy SVM models, with the same data. The results show that our model outperforms all other comparative models in prediction accuracy as well as return on investment.

An Integrated Model based on Genetic Algorithms for Implementing Cost-Effective Intelligent Intrusion Detection Systems (비용효율적 지능형 침입탐지시스템 구현을 위한 유전자 알고리즘 기반 통합 모형)

  • Lee, Hyeon-Uk;Kim, Ji-Hun;Ahn, Hyun-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.1
    • /
    • pp.125-141
    • /
    • 2012
  • These days, the malicious attacks and hacks on the networked systems are dramatically increasing, and the patterns of them are changing rapidly. Consequently, it becomes more important to appropriately handle these malicious attacks and hacks, and there exist sufficient interests and demand in effective network security systems just like intrusion detection systems. Intrusion detection systems are the network security systems for detecting, identifying and responding to unauthorized or abnormal activities appropriately. Conventional intrusion detection systems have generally been designed using the experts' implicit knowledge on the network intrusions or the hackers' abnormal behaviors. However, they cannot handle new or unknown patterns of the network attacks, although they perform very well under the normal situation. As a result, recent studies on intrusion detection systems use artificial intelligence techniques, which can proactively respond to the unknown threats. For a long time, researchers have adopted and tested various kinds of artificial intelligence techniques such as artificial neural networks, decision trees, and support vector machines to detect intrusions on the network. However, most of them have just applied these techniques singularly, even though combining the techniques may lead to better detection. With this reason, we propose a new integrated model for intrusion detection. Our model is designed to combine prediction results of four different binary classification models-logistic regression (LOGIT), decision trees (DT), artificial neural networks (ANN), and support vector machines (SVM), which may be complementary to each other. As a tool for finding optimal combining weights, genetic algorithms (GA) are used. Our proposed model is designed to be built in two steps. At the first step, the optimal integration model whose prediction error (i.e. erroneous classification rate) is the least is generated. After that, in the second step, it explores the optimal classification threshold for determining intrusions, which minimizes the total misclassification cost. To calculate the total misclassification cost of intrusion detection system, we need to understand its asymmetric error cost scheme. Generally, there are two common forms of errors in intrusion detection. The first error type is the False-Positive Error (FPE). In the case of FPE, the wrong judgment on it may result in the unnecessary fixation. The second error type is the False-Negative Error (FNE) that mainly misjudges the malware of the program as normal. Compared to FPE, FNE is more fatal. Thus, total misclassification cost is more affected by FNE rather than FPE. To validate the practical applicability of our model, we applied it to the real-world dataset for network intrusion detection. The experimental dataset was collected from the IDS sensor of an official institution in Korea from January to June 2010. We collected 15,000 log data in total, and selected 10,000 samples from them by using random sampling method. Also, we compared the results from our model with the results from single techniques to confirm the superiority of the proposed model. LOGIT and DT was experimented using PASW Statistics v18.0, and ANN was experimented using Neuroshell R4.0. For SVM, LIBSVM v2.90-a freeware for training SVM classifier-was used. Empirical results showed that our proposed model based on GA outperformed all the other comparative models in detecting network intrusions from the accuracy perspective. They also showed that the proposed model outperformed all the other comparative models in the total misclassification cost perspective. Consequently, it is expected that our study may contribute to build cost-effective intelligent intrusion detection systems.

A Study on the Changes of Land Use and Stand Volume around Mt. Kuem-O using Aerial Photographs (항공사진(航空寫眞)을 이용(利用)한 금오산(金烏山) 지역(地域)의 토지이용(土地利用) 및 임분재적(林分材積)의 변화(變化)에 관(關)한 연구(硏究))

  • Oh, Dong Ha;Kim, Kap Duk
    • Journal of Korean Society of Forest Science
    • /
    • v.79 no.4
    • /
    • pp.388-397
    • /
    • 1990
  • This study was conducted to investigate the changes of land use and stand volume around Mt. Kuem-O by B/W aerial photographs in 1979 and B/W Infrared aerial photographs in 1988. The results obtained in this study were as follow : 1. In classification of forest type on aerial photographs, coniferous stand was dark tone and hardwood stand was light tone and irregularly rounded crowns. 2. In classification of coniferous stand, Pinus densiflora was narraw cone and rounded tip of crowns and rough texture, Pinus rigida was irregulary rounded and broadly conical crowns. 3. To refer to changes of forest land area, mixed forest was changed into P. desiflora (687ha), P. rigida (130ha) and hardwood stand (219ha). 4. The regression equations between crown diameter and DBH were significant at 1% level by F-test in all stands. So the equation, D=a+bCD was used to estimate DBH. 5. The tree height curve equations were significant at 1% level by F-test in all stands. To estimate tree height the equation, logH=loga+blogD was adopted in P. densiflora and L. leptolepis and $H=a-bD+cD^2$ was adopted in P. rigida, hardwood stand and mixed forest. 6. The highest volume per hectare was observed in L. leptolepis and mixed forest showed the greatest growth percentage, while the lowest volume per hectare and growth percentage were observed in hardwood stand.

  • PDF

Clinicoradiologic Characteristics of Intradural Extramedullary Conventional Spinal Ependymoma (경막내 척수외 뇌실막세포종의 임상 영상의학적 특징)

  • Seung Hyun Lee;Yoon Jin Cha;Yong Eun Cho;Mina Park;Bio Joo;Sang Hyun Suh;Sung Jun Ahn
    • Journal of the Korean Society of Radiology
    • /
    • v.84 no.5
    • /
    • pp.1066-1079
    • /
    • 2023
  • Purpose Distinguishing intradural extramedullary (IDEM) spinal ependymoma from myxopapillary ependymoma is challenging due to the location of IDEM spinal ependymoma. This study aimed to investigate the utility of clinical and MR imaging features for differentiating between IDEM spinal and myxopapillary ependymomas. Materials and Methods We compared tumor size, longitudinal/axial location, enhancement degree/pattern, tumor margin, signal intensity (SI) of the tumor on T2-weighted images and T1-weighted image (T1WI), increased cerebrospinal fluid (CSF) SI caudal to the tumor on T1WI, and CSF dissemination of pathologically confirmed 12 IDEM spinal and 10 myxopapillary ependymomas. Furthermore, classification and regression tree (CART) was performed to identify the clinical and MR features for differentiating between IDEM spinal and myxopapillary ependymomas. Results Patients with IDEM spinal ependymomas were older than those with myxopapillary ependymomas (48 years vs. 29.5 years, p < 0.05). A high SI of the tumor on T1W1 was more frequently observed in IDEM spinal ependymomas than in myxopapillary ependymomas (p = 0.02). Conversely, myxopapillary ependymomas show CSF dissemination. Increased CSF SI caudal to the tumor on T1WI was observed more frequently in myxopapillary ependymomas than in IDEM spinal ependymomas (p < 0.05). Dissemination to the CSF space and increased CSF SI caudal to the tumor on T1WI were the most important variables in CART analysis. Conclusion Clinical and radiological variables may help differentiate between IDEM spinal and myxopapillary ependymomas.

Studies on the Morphological, Physical and Chemical Properties of the Korean Forest soil in Relation to the Growth of Korean White Pine and Japanese Larch (한국산림토양의 형태학적 및 이화학적성질과 낙엽송, 잣나무의 성장(成長)에 관한 연구(硏究))

  • Chung, In-Koo
    • Korean Journal of Soil Science and Fertilizer
    • /
    • v.12 no.4
    • /
    • pp.189-213
    • /
    • 1980
  • 1. Aiming at supply of basic informations on tree species siting and forest fertilization by understanding of soil properties that are demanded by each tree species through studies of forest soil's morphological, physical and chemical properties in relation to tree growth in our country, the necessary data have been collected in the last 10 years, are quantified according to quantification theory and are analyzed in accordance with multi-variate analysis. 2. Test species, larch and the Korean white pine, are plantable in extensive areas from mid to north in the temperate zone and are the two most recommended reforestation tree species in Korea. However, their respective site demands are not known and they have been in confusion or considered demanding the same site during reforestation. When the Korean white pine is planted in larch sites, it has shown relatively good growth. But, when larch is planted in the Korean white pine site it can be hardly said that the larch growth is good. To understand on such a difference soil factors have been studied so as to see how the soil's morphological, physical and chemical factors affect tree growth helped with the electronic computer. 3. All the stands examined are man-made mature forests. From 294 larch plots and 259 white pine plots dominant trees are cut as samples and through stem analysis site index is determined. For each site index soil profiles are made in the related forest-land for analysis. Soil samples are taken from each profile horizon and forest-land productivity classification tables are worked out through physical and chemical analysis of the soil samples for each tree species for the study of relationships between physical, chemical and the combined physical/chemical properties of soil and tree growth. 4. In the study of relationships between physical properties of soil and tree growth it is found out that larch growth is influenced by the following factors in the order of deposit form, soil depth, soil moisture, altitude, relief, soil type, depth of A-horizon, soil consistency content of organic matter soil texture bed rock gravel content aspect and slope. For the Korean white pine the influencing factors' order is soil type, soil consistency bed rock aspect depth of A-horizon soil moisture altitude relief deposit form soil depth soil texture gravel content and slope. 5. In the study of relationships between chemical properties of soil and tree growth it is found out that larch growth is influenced by the following factors in the order of base saturation organic matter CaO C/N ratio, effective $P_2O_5$ PH.exchangeable $K_2O$ T-N MgO C E C Total Base and Na. For the Korean white pine the influencing factors' order is effective $P_2O_5$ Total Base T-N Na C/N ratio PH CaO base saturation organic matter exchangeable $K_2O$ C E C and MgO. 6. In the study of relationships between the combined physical and chemical properties of soil and tree growth it is found out that larch growth is influenced by the following factors in the order of soil depth deposit form soil moisture PH relief soil type altitude T-N soil consistency effective $P_2O_5$ soil texture depth of A-horizon Total Base exchangeable $K_2O$ and base saturation. For the Korean white pine the influencing factors' order is soil type soil consistency aspect effective $P_2O_5$ depth of A-horizon exchangeable $K_2O$ soil moisture Total Base altitude soil depth base saturation relief T-N C/N ratio and deposit from. 7. In the multiple regression of forest soil's physical properties larch's correlation coefficient is 0.9272 and for the Korean white pine it is 0.8996. With chemical properties larch has 0.7474 and the Korean white pine has 0.7365. So, the soil's physical properties are found out more closely related with tree growth than chemical properties. However, this seems due to inadequate expression of soil's chemical factors and it is proved that the chemical properties are not less important than the physical properties. In the multiple regression of the combined physical and chemical properties consisting of important morphological and physical factors as well as chemical factors of forest soils larch's multiple correlation coefficient is found out to be 0.9434 and for the Korean white pine it is 0.9103 leading to the highest correlation. 8. As shown in the partial correlation coefficients larch needs deeper soil depth than the Korean white pine and in the deposit form colluvial and creeping soils are demanded by the larch. Adequately moist to too moist should be soil moisture and PH should be from 5.5 to 6.1 for the larch. Demands of T-N soil texture and soil nutrients are higher for the larch than the Korean white pine. Thus, soil depth, deposit form, relief soil moisture PH N altitude and soil texture are good indicators for species sitings with larch and the Korean white pine while soil type and soil consistency are indicative only limitedly of species sitings due to their wide variation as plantation environments. For larch siting soil depth deposit form relief soil moisture PH soil type N and soil texture are indicators of good growth and for Korean white pine they are soil type soil consistency effective $P_2O_5$ and exchangeable $K_2O$, which is demanded more by the Korean white pine than larch generally. 9. Physical properties of soil has been known as affecting tree growth to greatest extent so far. However, as a result of this study it is proved through computer analysis that chemical properties of soil are not less important factors for tree growth than chemical properties and site demands for larch and the Korean white pine that have been uncertain So far could be clarified.

  • PDF

Analysis of Utilization Characteristics, Health Behaviors and Health Management Level of Participants in Private Health Examination in a General Hospital (일개 종합병원의 민간 건강검진 수검자의 검진이용 특성, 건강행태 및 건강관리 수준 분석)

  • Kim, Yoo-Mi;Park, Jong-Ho;Kim, Won-Joong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.14 no.1
    • /
    • pp.301-311
    • /
    • 2013
  • This study aims to analyze characteristics, health behaviors and health management level related to private health examination recipients in one general hospital. To achieve this, we analyzed 150,501 cases of private health examination data for 11 years from 2001 to 2011 for 20,696 participants in 2011 in a Dae-Jeon general hospital health examination center. The cluster analysis for classify private health examination group is used z-score standardization of K-means clustering method. The logistic regression analysis, decision tree and neural network analysis are used to periodic/non-periodic private health examination classification model. 1,000 people were selected as a customer management business group that has high probability to be non-periodic private health examination patients in new private health examination. According to results of this study, private health examination group was categorized by new, periodic and non-periodic group. New participants in private health examination were more 30~39 years old person than other age groups and more patients suspected of having renal disease. Periodic participants in private health examination were more male participants and more patients suspected of having hyperlipidemia. Non-periodic participants in private health examination were more smoking and sitting person and more patients suspected of having anemia and diabetes mellitus. As a result of decision tree, variables related to non-periodic participants in private health examination were sex, age, residence, exercise, anemia, hyperlipidemia, diabetes mellitus, obesity and liver disease. In particular, 71.4% of non-periodic participants were female, non-anemic, non-exercise, and suspicious obesity person. To operation of customized customer management business for private health examination will contribute to efficiency in health examination center.