• Title/Summary/Keyword: Input Variable Selection

Search Result 67, Processing Time 0.068 seconds

The Prediction of DEA based Efficiency Rating for Venture Business Using Multi-class SVM (다분류 SVM을 이용한 DEA기반 벤처기업 효율성등급 예측모형)

  • Park, Ji-Young;Hong, Tae-Ho
    • Asia pacific journal of information systems
    • /
    • v.19 no.2
    • /
    • pp.139-155
    • /
    • 2009
  • For the last few decades, many studies have tried to explore and unveil venture companies' success factors and unique features in order to identify the sources of such companies' competitive advantages over their rivals. Such venture companies have shown tendency to give high returns for investors generally making the best use of information technology. For this reason, many venture companies are keen on attracting avid investors' attention. Investors generally make their investment decisions by carefully examining the evaluation criteria of the alternatives. To them, credit rating information provided by international rating agencies, such as Standard and Poor's, Moody's and Fitch is crucial source as to such pivotal concerns as companies stability, growth, and risk status. But these types of information are generated only for the companies issuing corporate bonds, not venture companies. Therefore, this study proposes a method for evaluating venture businesses by presenting our recent empirical results using financial data of Korean venture companies listed on KOSDAQ in Korea exchange. In addition, this paper used multi-class SVM for the prediction of DEA-based efficiency rating for venture businesses, which was derived from our proposed method. Our approach sheds light on ways to locate efficient companies generating high level of profits. Above all, in determining effective ways to evaluate a venture firm's efficiency, it is important to understand the major contributing factors of such efficiency. Therefore, this paper is constructed on the basis of following two ideas to classify which companies are more efficient venture companies: i) making DEA based multi-class rating for sample companies and ii) developing multi-class SVM-based efficiency prediction model for classifying all companies. First, the Data Envelopment Analysis(DEA) is a non-parametric multiple input-output efficiency technique that measures the relative efficiency of decision making units(DMUs) using a linear programming based model. It is non-parametric because it requires no assumption on the shape or parameters of the underlying production function. DEA has been already widely applied for evaluating the relative efficiency of DMUs. Recently, a number of DEA based studies have evaluated the efficiency of various types of companies, such as internet companies and venture companies. It has been also applied to corporate credit ratings. In this study we utilized DEA for sorting venture companies by efficiency based ratings. The Support Vector Machine(SVM), on the other hand, is a popular technique for solving data classification problems. In this paper, we employed SVM to classify the efficiency ratings in IT venture companies according to the results of DEA. The SVM method was first developed by Vapnik (1995). As one of many machine learning techniques, SVM is based on a statistical theory. Thus far, the method has shown good performances especially in generalizing capacity in classification tasks, resulting in numerous applications in many areas of business, SVM is basically the algorithm that finds the maximum margin hyperplane, which is the maximum separation between classes. According to this method, support vectors are the closest to the maximum margin hyperplane. If it is impossible to classify, we can use the kernel function. In the case of nonlinear class boundaries, we can transform the inputs into a high-dimensional feature space, This is the original input space and is mapped into a high-dimensional dot-product space. Many studies applied SVM to the prediction of bankruptcy, the forecast a financial time series, and the problem of estimating credit rating, In this study we employed SVM for developing data mining-based efficiency prediction model. We used the Gaussian radial function as a kernel function of SVM. In multi-class SVM, we adopted one-against-one approach between binary classification method and two all-together methods, proposed by Weston and Watkins(1999) and Crammer and Singer(2000), respectively. In this research, we used corporate information of 154 companies listed on KOSDAQ market in Korea exchange. We obtained companies' financial information of 2005 from the KIS(Korea Information Service, Inc.). Using this data, we made multi-class rating with DEA efficiency and built multi-class prediction model based data mining. Among three manners of multi-classification, the hit ratio of the Weston and Watkins method is the best in the test data set. In multi classification problems as efficiency ratings of venture business, it is very useful for investors to know the class with errors, one class difference, when it is difficult to find out the accurate class in the actual market. So we presented accuracy results within 1-class errors, and the Weston and Watkins method showed 85.7% accuracy in our test samples. We conclude that the DEA based multi-class approach in venture business generates more information than the binary classification problem, notwithstanding its efficiency level. We believe this model can help investors in decision making as it provides a reliably tool to evaluate venture companies in the financial domain. For the future research, we perceive the need to enhance such areas as the variable selection process, the parameter selection of kernel function, the generalization, and the sample size of multi-class.

Incremental Ensemble Learning for The Combination of Multiple Models of Locally Weighted Regression Using Genetic Algorithm (유전 알고리즘을 이용한 국소가중회귀의 다중모델 결합을 위한 점진적 앙상블 학습)

  • Kim, Sang Hun;Chung, Byung Hee;Lee, Gun Ho
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.9
    • /
    • pp.351-360
    • /
    • 2018
  • The LWR (Locally Weighted Regression) model, which is traditionally a lazy learning model, is designed to obtain the solution of the prediction according to the input variable, the query point, and it is a kind of the regression equation in the short interval obtained as a result of the learning that gives a higher weight value closer to the query point. We study on an incremental ensemble learning approach for LWR, a form of lazy learning and memory-based learning. The proposed incremental ensemble learning method of LWR is to sequentially generate and integrate LWR models over time using a genetic algorithm to obtain a solution of a specific query point. The weaknesses of existing LWR models are that multiple LWR models can be generated based on the indicator function and data sample selection, and the quality of the predictions can also vary depending on this model. However, no research has been conducted to solve the problem of selection or combination of multiple LWR models. In this study, after generating the initial LWR model according to the indicator function and the sample data set, we iterate evolution learning process to obtain the proper indicator function and assess the LWR models applied to the other sample data sets to overcome the data set bias. We adopt Eager learning method to generate and store LWR model gradually when data is generated for all sections. In order to obtain a prediction solution at a specific point in time, an LWR model is generated based on newly generated data within a predetermined interval and then combined with existing LWR models in a section using a genetic algorithm. The proposed method shows better results than the method of selecting multiple LWR models using the simple average method. The results of this study are compared with the predicted results using multiple regression analysis by applying the real data such as the amount of traffic per hour in a specific area and hourly sales of a resting place of the highway, etc.

Construction and Application of Network Design System for Optimal Water Quality Monitoring in Reservoir (저수지 최적수질측정망 구축시스템 개발 및 적용)

  • Lee, Yo-Sang;Kwon, Se-Hyug;Lee, Sang-Uk;Ban, Yang-Jin
    • Journal of Korea Water Resources Association
    • /
    • v.44 no.4
    • /
    • pp.295-304
    • /
    • 2011
  • For effective water quality management, it is necessary to secure reliable water quality information. There are many variables that need to be included in a comprehensive practical monitoring network : representative sampling locations, suitable sampling frequencies, water quality variable selection, and budgetary and logistical constraints are examples, especially sampling location is considered to be the most important issues. Until now, monitoring network design for water quality management was set according to the qualitative judgments, which is a problem of representativeness. In this paper, we propose network design system for optimal water quality monitoring using the scientific statistical techniques. Network design system is made based on the SAS program of version 9.2 and configured with simple input system and user friendly outputs considering the convenience of users. It applies to Excel data format for ease to use and all data of sampling location is distinguished to sheet base. In this system, time plots, dendrogram, and scatter plots are shown as follows: Time plots of water quality variables are graphed for identifying variables to classify sampling locations significantly. Similarities of sampling locations are calculated using euclidean distances of principal component variables and dimension coordinate of multidimensional scaling method are calculated and dendrogram by clustering analysis is represented and used for users to choose an appropriate number of clusters. Scatter plots of principle component variables are shown for clustering information with sampling locations and representative location.

Genetic Programming based Manufacutring Big Data Analytics (유전 프로그래밍을 활용한 제조 빅데이터 분석 방법 연구)

  • Oh, Sanghoun;Ahn, Chang Wook
    • Smart Media Journal
    • /
    • v.9 no.3
    • /
    • pp.31-40
    • /
    • 2020
  • Currently, black-box-based machine learning algorithms are used to analyze big data in manufacturing. This algorithm has the advantage of having high analytical consistency, but has the disadvantage that it is difficult to interpret the analysis results. However, in the manufacturing industry, it is important to verify the basis of the results and the validity of deriving the analysis algorithms through analysis based on the manufacturing process principle. To overcome the limitation of explanatory power as a result of this machine learning algorithm, we propose a manufacturing big data analysis method using genetic programming. This algorithm is one of well-known evolutionary algorithms, which repeats evolutionary operators such as selection, crossover, mutation that mimic biological evolution to find the optimal solution. Then, the solution is expressed as a relationship between variables using mathematical symbols, and the solution with the highest explanatory power is finally selected. Through this, input and output variable relations are derived to formulate the results, so it is possible to interpret the intuitive manufacturing mechanism, and it is also possible to derive manufacturing principles that cannot be interpreted based on the relationship between variables represented by formulas. The proposed technique showed equal or superior performance as a result of comparing and analyzing performance with a typical machine learning algorithm. In the future, the possibility of using various manufacturing fields was verified through the technique.

Development of Bond Strength Model for FRP Plates Using Back-Propagation Algorithm (역전파 학습 알고리즘을 이용한 콘크리트와 부착된 FRP 판의 부착강도 모델 개발)

  • Park, Do-Kyong
    • Journal of the Korea institute for structural maintenance and inspection
    • /
    • v.10 no.2
    • /
    • pp.133-144
    • /
    • 2006
  • In order to catch out such Bond Strength, the preceding researchers had ever examined the Bond Strength of FRP Plate through their experimentations by setting up of various fluent. However, since the experiment for research on such Bond Strength takes much of expenditure for equipment structure and time-consuming, also difficult to carry out, it is conducting limitedly. This Study purposes to develop the most suitable Artificial Neural Network Model by application of various Neural Network Model and Algorithm to the adhering experiment data of the preceding researchers. Output Layer of Artificial Neural Network Model, and Input Layer of Bond Strength were performed the learning by selection as the variable of the thickness, width, adhered length, the modulus of elasticity, tensile strength, and the compressive strength of concrete, tensile strength, width, respectively. The developed Artificial Neural Network Model has applied Back-Propagation, and its error was learnt to be converged within the range of 0.001. Besides, the process for generalization has dissolved the problem of Over-Fitting in the way of more generalized method by introduction of Bayesian Technique. The verification on the developed Model was executed by comparison with the resulted value of Bond Strength made by the other preceding researchers which was never been utilized to the learning as yet.

WQI Class Prediction of Sihwa Lake Using Machine Learning-Based Models (기계학습 기반 모델을 활용한 시화호의 수질평가지수 등급 예측)

  • KIM, SOO BIN;LEE, JAE SEONG;KIM, KYUNG TAE
    • The Sea:JOURNAL OF THE KOREAN SOCIETY OF OCEANOGRAPHY
    • /
    • v.27 no.2
    • /
    • pp.71-86
    • /
    • 2022
  • The water quality index (WQI) has been widely used to evaluate marine water quality. The WQI in Korea is categorized into five classes by marine environmental standards. But, the WQI calculation on huge datasets is a very complex and time-consuming process. In this regard, the current study proposed machine learning (ML) based models to predict WQI class by using water quality datasets. Sihwa Lake, one of specially-managed coastal zone, was selected as a modeling site. In this study, adaptive boosting (AdaBoost) and tree-based pipeline optimization (TPOT) algorithms were used to train models and each model performance was evaluated by metrics (accuracy, precision, F1, and Log loss) on classification. Before training, the feature importance and sensitivity analysis were conducted to find out the best input combination for each algorithm. The results proved that the bottom dissolved oxygen (DOBot) was the most important variable affecting model performance. Conversely, surface dissolved inorganic nitrogen (DINSur) and dissolved inorganic phosphorus (DIPSur) had weaker effects on the prediction of WQI class. In addition, the performance varied over features including stations, seasons, and WQI classes by comparing spatio-temporal and class sensitivities of each best model. In conclusion, the modeling results showed that the TPOT algorithm has better performance rather than the AdaBoost algorithm without considering feature selection. Moreover, the WQI class for unknown water quality datasets could be surely predicted using the TPOT model trained with satisfactory training datasets.

Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis (부도예측을 위한 KNN 앙상블 모형의 동시 최적화)

  • Min, Sung-Hwan
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.139-157
    • /
    • 2016
  • Bankruptcy involves considerable costs, so it can have significant effects on a country's economy. Thus, bankruptcy prediction is an important issue. Over the past several decades, many researchers have addressed topics associated with bankruptcy prediction. Early research on bankruptcy prediction employed conventional statistical methods such as univariate analysis, discriminant analysis, multiple regression, and logistic regression. Later on, many studies began utilizing artificial intelligence techniques such as inductive learning, neural networks, and case-based reasoning. Currently, ensemble models are being utilized to enhance the accuracy of bankruptcy prediction. Ensemble classification involves combining multiple classifiers to obtain more accurate predictions than those obtained using individual models. Ensemble learning techniques are known to be very useful for improving the generalization ability of the classifier. Base classifiers in the ensemble must be as accurate and diverse as possible in order to enhance the generalization ability of an ensemble model. Commonly used methods for constructing ensemble classifiers include bagging, boosting, and random subspace. The random subspace method selects a random feature subset for each classifier from the original feature space to diversify the base classifiers of an ensemble. Each ensemble member is trained by a randomly chosen feature subspace from the original feature set, and predictions from each ensemble member are combined by an aggregation method. The k-nearest neighbors (KNN) classifier is robust with respect to variations in the dataset but is very sensitive to changes in the feature space. For this reason, KNN is a good classifier for the random subspace method. The KNN random subspace ensemble model has been shown to be very effective for improving an individual KNN model. The k parameter of KNN base classifiers and selected feature subsets for base classifiers play an important role in determining the performance of the KNN ensemble model. However, few studies have focused on optimizing the k parameter and feature subsets of base classifiers in the ensemble. This study proposed a new ensemble method that improves upon the performance KNN ensemble model by optimizing both k parameters and feature subsets of base classifiers. A genetic algorithm was used to optimize the KNN ensemble model and improve the prediction accuracy of the ensemble model. The proposed model was applied to a bankruptcy prediction problem by using a real dataset from Korean companies. The research data included 1800 externally non-audited firms that filed for bankruptcy (900 cases) or non-bankruptcy (900 cases). Initially, the dataset consisted of 134 financial ratios. Prior to the experiments, 75 financial ratios were selected based on an independent sample t-test of each financial ratio as an input variable and bankruptcy or non-bankruptcy as an output variable. Of these, 24 financial ratios were selected by using a logistic regression backward feature selection method. The complete dataset was separated into two parts: training and validation. The training dataset was further divided into two portions: one for the training model and the other to avoid overfitting. The prediction accuracy against this dataset was used to determine the fitness value in order to avoid overfitting. The validation dataset was used to evaluate the effectiveness of the final model. A 10-fold cross-validation was implemented to compare the performances of the proposed model and other models. To evaluate the effectiveness of the proposed model, the classification accuracy of the proposed model was compared with that of other models. The Q-statistic values and average classification accuracies of base classifiers were investigated. The experimental results showed that the proposed model outperformed other models, such as the single model and random subspace ensemble model.