• Title/Summary/Keyword: decision tree and system analysis

Search Result 218, Processing Time 0.03 seconds

Prediction of commitment and persistence in heterosexual involvements according to the styles of loving using a datamining technique (데이터마이닝을 활용한 사랑의 형태에 따른 연인관계 몰입수준 및 관계 지속여부 예측)

  • Park, Yoon-Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.4
    • /
    • pp.69-85
    • /
    • 2016
  • Successful relationship with loving partners is one of the most important factors in life. In psychology, there have been some previous researches studying the factors influencing romantic relationships. However, most of these researches were performed based on statistical analysis; thus they have limitations in analyzing complex non-linear relationships or rules based reasoning. This research analyzes commitment and persistence in heterosexual involvement according to styles of loving using a datamining technique as well as statistical methods. In this research, we consider six different styles of loving - 'eros', 'ludus', 'stroge', 'pragma', 'mania' and 'agape' which influence romantic relationships between lovers, besides the factors suggested by the previous researches. These six types of love are defined by Lee (1977) as follows: 'eros' is romantic, passionate love; 'ludus' is a game-playing or uncommitted love; 'storge' is a slow developing, friendship-based love; 'pragma' is a pragmatic, practical, mutually beneficial relationship; 'mania' is an obsessive or possessive love and, lastly, 'agape' is a gentle, caring, giving type of love, brotherly love, not concerned with the self. In order to do this research, data from 105 heterosexual couples were collected. Using the data, a linear regression method was first performed to find out the important factors associated with a commitment to partners. The result shows that 'satisfaction', 'eros' and 'agape' are significant factors associated with the commitment level for both male and female. Interestingly, in male cases, 'agape' has a greater effect on commitment than 'eros'. On the other hand, in female cases, 'eros' is a more significant factor than 'agape' to commitment. In addition to that, 'investment' of the male is also crucial factor for male commitment. Next, decision tree analysis was performed to find out the characteristics of high commitment couples and low commitment couples. In order to build decision tree models in this experiment, 'decision tree' operator in the datamining tool, Rapid Miner was used. The experimental result shows that males having a high satisfaction level in relationship show a high commitment level. However, even though a male may not have a high satisfaction level, if he has made a lot of financial or mental investment in relationship, and his partner shows him a certain amount of 'agape', then he also shows a high commitment level to the female. In the case of female, a women having a high 'eros' and 'satisfaction' level shows a high commitment level. Otherwise, even though a female may not have a high satisfaction level, if her partner shows a certain amount of 'mania' then the female also shows a high commitment level. Finally, this research built a prediction model to establish whether the relationship will persist or break up using a decision tree. The result shows that the most important factor influencing to the break up is a 'narcissistic tendency' of the male. In addition to that, 'satisfaction', 'investment' and 'mania' of both male and female also affect a break up. Interestingly, while the 'mania' level of a male works positively to maintain the relationship, that of a female has a negative influence. The contribution of this research is adopting a new technique of analysis using a datamining method for psychology. In addition, the results of this research can provide useful advice to couples for building a harmonious relationship with each other. This research has several limitations. First, the experimental data was sampled based on oversampling technique to balance the size of each classes. Thus, it has a limitation of evaluating performances of the predictive models objectively. Second, the result data, whether the relationship persists of not, was collected relatively in short periods - 6 months after the initial data collection. Lastly, most of the respondents of the survey is in their 20's. In order to get more general results, we would like to extend this research to general populations.

Seoul Local Brand Alley Commercial Area Recommendation System Design Using Machine Learning (머신러닝 기반 서울시 로컬브랜드 골목상권 추천시스템 설계)

  • Jiyeon, Kim;Hyoseon, Jang;Minseo, Park
    • The Journal of the Convergence on Culture Technology
    • /
    • v.9 no.1
    • /
    • pp.101-109
    • /
    • 2023
  • According to data released by the Covid 19 Self-Employed Emergency Response Committee, 95.6% of small business sales due to Covid 19 have decreased over the past two years, and the damage has further increased due to social distancing for quarantine. However, as all social distancing guidelines have rebeen lifted, and the commercial district has been revitalized, the Seoul Metropolitan Government is pushing for a project to foster local brand commercial districts so that small business owners or prospective founders who have closed their businesses due to the prolonged COVID-19. Therefore, this study propose the model that recommends alley commercial districts suitable for founders among the five alley commercial districts selected for the project to foster local brand commercial districts in Seoul. The Seoul Metropolitan Government's local brand alley commercial recommendation system recommends major population age groups and major industries in the commercial district by combining the population perspective model using Xgboost and the commercial district characteristic model using Decision Tree.

A Study on the Insider Behavior Analysis Using Machine Learning for Detecting Information Leakage (정보 유출 탐지를 위한 머신 러닝 기반 내부자 행위 분석 연구)

  • Kauh, Janghyuk;Lee, Dongho
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.13 no.2
    • /
    • pp.1-11
    • /
    • 2017
  • In this paper, we design and implement PADIL(Prediction And Detection of Information Leakage) system that predicts and detect information leakage behavior of insider by analyzing network traffic and applying a variety of machine learning methods. we defined the five-level information leakage model(Reconnaissance, Scanning, Access and Escalation, Exfiltration, Obfuscation) by referring to the cyber kill-chain model. In order to perform the machine learning for detecting information leakage, PADIL system extracts various features by analyzing the network traffic and extracts the behavioral features by comparing it with the personal profile information and extracts information leakage level features. We tested various machine learning methods and as a result, the DecisionTree algorithm showed excellent performance in information leakage detection and we showed that performance can be further improved by fine feature selection.

Ensemble of Nested Dichotomies for Activity Recognition Using Accelerometer Data on Smartphone (Ensemble of Nested Dichotomies 기법을 이용한 스마트폰 가속도 센서 데이터 기반의 동작 인지)

  • Ha, Eu Tteum;Kim, Jeongmin;Ryu, Kwang Ryel
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.4
    • /
    • pp.123-132
    • /
    • 2013
  • As the smartphones are equipped with various sensors such as the accelerometer, GPS, gravity sensor, gyros, ambient light sensor, proximity sensor, and so on, there have been many research works on making use of these sensors to create valuable applications. Human activity recognition is one such application that is motivated by various welfare applications such as the support for the elderly, measurement of calorie consumption, analysis of lifestyles, analysis of exercise patterns, and so on. One of the challenges faced when using the smartphone sensors for activity recognition is that the number of sensors used should be minimized to save the battery power. When the number of sensors used are restricted, it is difficult to realize a highly accurate activity recognizer or a classifier because it is hard to distinguish between subtly different activities relying on only limited information. The difficulty gets especially severe when the number of different activity classes to be distinguished is very large. In this paper, we show that a fairly accurate classifier can be built that can distinguish ten different activities by using only a single sensor data, i.e., the smartphone accelerometer data. The approach that we take to dealing with this ten-class problem is to use the ensemble of nested dichotomy (END) method that transforms a multi-class problem into multiple two-class problems. END builds a committee of binary classifiers in a nested fashion using a binary tree. At the root of the binary tree, the set of all the classes are split into two subsets of classes by using a binary classifier. At a child node of the tree, a subset of classes is again split into two smaller subsets by using another binary classifier. Continuing in this way, we can obtain a binary tree where each leaf node contains a single class. This binary tree can be viewed as a nested dichotomy that can make multi-class predictions. Depending on how a set of classes are split into two subsets at each node, the final tree that we obtain can be different. Since there can be some classes that are correlated, a particular tree may perform better than the others. However, we can hardly identify the best tree without deep domain knowledge. The END method copes with this problem by building multiple dichotomy trees randomly during learning, and then combining the predictions made by each tree during classification. The END method is generally known to perform well even when the base learner is unable to model complex decision boundaries As the base classifier at each node of the dichotomy, we have used another ensemble classifier called the random forest. A random forest is built by repeatedly generating a decision tree each time with a different random subset of features using a bootstrap sample. By combining bagging with random feature subset selection, a random forest enjoys the advantage of having more diverse ensemble members than a simple bagging. As an overall result, our ensemble of nested dichotomy can actually be seen as a committee of committees of decision trees that can deal with a multi-class problem with high accuracy. The ten classes of activities that we distinguish in this paper are 'Sitting', 'Standing', 'Walking', 'Running', 'Walking Uphill', 'Walking Downhill', 'Running Uphill', 'Running Downhill', 'Falling', and 'Hobbling'. The features used for classifying these activities include not only the magnitude of acceleration vector at each time point but also the maximum, the minimum, and the standard deviation of vector magnitude within a time window of the last 2 seconds, etc. For experiments to compare the performance of END with those of other methods, the accelerometer data has been collected at every 0.1 second for 2 minutes for each activity from 5 volunteers. Among these 5,900 ($=5{\times}(60{\times}2-2)/0.1$) data collected for each activity (the data for the first 2 seconds are trashed because they do not have time window data), 4,700 have been used for training and the rest for testing. Although 'Walking Uphill' is often confused with some other similar activities, END has been found to classify all of the ten activities with a fairly high accuracy of 98.4%. On the other hand, the accuracies achieved by a decision tree, a k-nearest neighbor, and a one-versus-rest support vector machine have been observed as 97.6%, 96.5%, and 97.6%, respectively.

A Study on RCM Approach to Catenary System of Electric Railway (전기철도 가공전차선로의 신뢰성 기반 유지보수(RCM)에 관한 연구)

  • Youn, Eung-Kyu;Choi, Kyu-Hyoung
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.65 no.8
    • /
    • pp.1457-1465
    • /
    • 2016
  • A RCM approach to maintenance of the catenary system of electric railway is proposed. The proposed RCM approach provides a maintenance-oriented FMECA procedure to derive critical failure modes by analyzing failure effects and a RCM decision logic tree to suggest optimal maintenance works for the derived failure modes. By applying the proposed RCM procedures to the catenary system of high speed railway, it is suggested that strand breaks of dropper and voltage equalizing wire, and trolly wire wear-out are the critical failure modes for whom maintenance works based on condition monitoring should be applied instead of conventional time-based preventive maintenance. It is also proposed by reliability analysis that replacement time of dropper can be reduced from 18 years to 10 years. These results show that the proposed RCM approach can optimize the maintenance procedures of catenary system.

Development of a Detection Model for the Companies Designated as Administrative Issue in KOSDAQ Market (KOSDAQ 시장의 관리종목 지정 탐지 모형 개발)

  • Shin, Dong-In;Kwahk, Kee-Young
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.157-176
    • /
    • 2018
  • The purpose of this research is to develop a detection model for companies designated as administrative issue in KOSDAQ market using financial data. Administration issue designates the companies with high potential for delisting, which gives them time to overcome the reasons for the delisting under certain restrictions of the Korean stock market. It acts as an alarm to inform investors and market participants of which companies are likely to be delisted and warns them to make safe investments. Despite this importance, there are relatively few studies on administration issues prediction model in comparison with the lots of studies on bankruptcy prediction model. Therefore, this study develops and verifies the detection model of the companies designated as administrative issue using financial data of KOSDAQ companies. In this study, logistic regression and decision tree are proposed as the data mining models for detecting administrative issues. According to the results of the analysis, the logistic regression model predicted the companies designated as administrative issue using three variables - ROE(Earnings before tax), Cash flows/Shareholder's equity, and Asset turnover ratio, and its overall accuracy was 86% for the validation dataset. The decision tree (Classification and Regression Trees, CART) model applied the classification rules using Cash flows/Total assets and ROA(Net income), and the overall accuracy reached 87%. Implications of the financial indictors selected in our logistic regression and decision tree models are as follows. First, ROE(Earnings before tax) in the logistic detection model shows the profit and loss of the business segment that will continue without including the revenue and expenses of the discontinued business. Therefore, the weakening of the variable means that the competitiveness of the core business is weakened. If a large part of the profits is generated from one-off profit, it is very likely that the deterioration of business management is further intensified. As the ROE of a KOSDAQ company decreases significantly, it is highly likely that the company can be delisted. Second, cash flows to shareholder's equity represents that the firm's ability to generate cash flow under the condition that the financial condition of the subsidiary company is excluded. In other words, the weakening of the management capacity of the parent company, excluding the subsidiary's competence, can be a main reason for the increase of the possibility of administrative issue designation. Third, low asset turnover ratio means that current assets and non-current assets are ineffectively used by corporation, or that asset investment by corporation is excessive. If the asset turnover ratio of a KOSDAQ-listed company decreases, it is necessary to examine in detail corporate activities from various perspectives such as weakening sales or increasing or decreasing inventories of company. Cash flow / total assets, a variable selected by the decision tree detection model, is a key indicator of the company's cash condition and its ability to generate cash from operating activities. Cash flow indicates whether a firm can perform its main activities(maintaining its operating ability, repaying debts, paying dividends and making new investments) without relying on external financial resources. Therefore, if the index of the variable is negative(-), it indicates the possibility that a company has serious problems in business activities. If the cash flow from operating activities of a specific company is smaller than the net profit, it means that the net profit has not been cashed, indicating that there is a serious problem in managing the trade receivables and inventory assets of the company. Therefore, it can be understood that as the cash flows / total assets decrease, the probability of administrative issue designation and the probability of delisting are increased. In summary, the logistic regression-based detection model in this study was found to be affected by the company's financial activities including ROE(Earnings before tax). However, decision tree-based detection model predicts the designation based on the cash flows of the company.

Classification of False Alarms based on the Decision Tree for Improving the Performance of Intrusion Detection Systems (침입탐지시스템의 성능향상을 위한 결정트리 기반 오경보 분류)

  • Shin, Moon-Sun;Ryu, Keun-Ho
    • Journal of KIISE:Databases
    • /
    • v.34 no.6
    • /
    • pp.473-482
    • /
    • 2007
  • Network-based IDS(Intrusion Detection System) gathers network packet data and analyzes them into attack or normal. They raise alarm when possible intrusion happens. But they often output a large amount of low-level of incomplete alert information. Consequently, a large amount of incomplete alert information that can be unmanageable and also be mixed with false alerts can prevent intrusion response systems and security administrator from adequately understanding and analyzing the state of network security, and initiating appropriate response in a timely fashion. So it is important for the security administrator to reduce the redundancy of alerts, integrate and correlate security alerts, construct attack scenarios and present high-level aggregated information. False alarm rate is the ratio between the number of normal connections that are incorrectly misclassified as attacks and the total number of normal connections. In this paper we propose a false alarm classification model to reduce the false alarm rate using classification analysis of data mining techniques. The proposed model can classify the alarms from the intrusion detection systems into false alert or true attack. Our approach is useful to reduce false alerts and to improve the detection rate of network-based intrusion detection systems.

Discussion for Improvement of Decision System of Total Risk in Off-site Risk Assessment (화학사고 장외영향평가 제도의 종합위험도 결정 체계 개선을 위한 고찰)

  • Choi, Woosoo;Ryu, Taekwon;Kwak, Sollim;Lim, Hyeongjun;Jung, Jinhee;Lee, Jieun;Kim, Jungkon;Baek, Jongbae;Yoon, Junheon;Ryu, Jisung
    • Journal of Environmental Health Sciences
    • /
    • v.44 no.3
    • /
    • pp.217-226
    • /
    • 2018
  • Objectives: Despite the positive effects of Off-site risk assessment (ORA) system such as prevention of chemical accidents, some problems have been constantly raised. The purpose of this study is to analyze the problems that have occurred through the implementation of the ORA system for the past three years and to suggest reasonable directions for improvement in the future. Methods: In order to identify the problems with the methodology and procedure of ORA system, we analyzed statutes, administrative rules and documents related to the ORA system. A survey of ORA reviewers in National Institute of Chemical Safety was conducted to investigate the weight of determinants considered when judging the level of total risk in ORA. Results: In this study, we found out the uncertainty of the estimation of the number of people in the impact range in the procedure of the risk assessment of individual handling facilities, the lack of quantitative risk analysis methods for environmental receptors, and the ambiguity of the criteria for the total risk. In addition to suggesting solutions to the problems mentioned above, we also, suggested a decision tree for total risk in ORA. Conclusion: We anticipate that the solutions including the systematic decision tree for total risk suggested will contribute to the smooth operation of the ORA system.

A Design Solution for a Railway Switch Monitoring System (분기기 진단 시스템 설계에 관한 연구)

  • Choo, Eun-Sang;Kim, Min-Seong;Yoo, Heung-Yeol;Mo, Choong-Seon;Son, Eui-Sik;Park, Seongguen;Lee, Jong-Woo
    • Journal of the Korean Society for Railway
    • /
    • v.18 no.5
    • /
    • pp.439-446
    • /
    • 2015
  • The turnout system, which determines the direction of the train, is not only a key system but also a vulnerable system. Failure of this system may lead to a delay of the train or even casualties. In this light, it is necessary to precisely the conditions of the turnout system. Currently, ROADMASTER of Germany is used as a diagnostic system in Korea. However, a new diagnostic system should be developed for optimized operation of the turnout system with maintenance that is suitable for the Korean railway environment. In this paper, a Fault Tree Analysis for the representative faults of the turnout system is conducted and physical quantities, which can be the cause of the fault, are classified according to the component and function. Also, the measuring factors for the monitoring are derived and a decision making theory is suggested. On the basis of the results, we propose a new turnout diagnostic system that can provide more driverse and precise information than the conventional system.

Machine Learning Based Automatic Categorization Model for Text Lines in Invoice Documents

  • Shin, Hyun-Kyung
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.12
    • /
    • pp.1786-1797
    • /
    • 2010
  • Automatic understanding of contents in document image is a very hard problem due to involvement with mathematically challenging problems originated mainly from the over-determined system induced by document segmentation process. In both academic and industrial areas, there have been incessant and various efforts to improve core parts of content retrieval technologies by the means of separating out segmentation related issues using semi-structured document, e.g., invoice,. In this paper we proposed classification models for text lines on invoice document in which text lines were clustered into the five categories in accordance with their contents: purchase order header, invoice header, summary header, surcharge header, purchase items. Our investigation was concentrated on the performance of machine learning based models in aspect of linear-discriminant-analysis (LDA) and non-LDA (logic based). In the group of LDA, na$\"{\i}$ve baysian, k-nearest neighbor, and SVM were used, in the group of non LDA, decision tree, random forest, and boost were used. We described the details of feature vector construction and the selection processes of the model and the parameter including training and validation. We also presented the experimental results of comparison on training/classification error levels for the models employed.