• Title/Summary/Keyword: administration information dataset

Search Result 84, Processing Time 0.024 seconds

A Study on the Records Management for the National Assembly Members (국회의원 기록관리 방안 연구)

  • Kim, Jang-hwan
    • The Korean Journal of Archival Studies
    • /
    • no.55
    • /
    • pp.39-71
    • /
    • 2018
  • The purpose of this study is to examine the reality of the records management of the National Assembly members and suggest a desirable alternative. Until the Public Records Management Act was enacted in 1999, the level of the records management in the National Assembly was not beyond that of the document management in both the administration and the legislature. Rather, the National Assembly has maintained a records management tradition that systematically manages the minutes and bills since the Constitutional Assembly. After the Act was legislated in 2000, the National Assembly Records Management Regulation was enacted and enforced, and the Archives was established in the form of a subsidiary organ of the Secretariat of the National Assembly, even though its establishment is not obligatory. In addition, for the first time, an archivist was assigned as a records and archives researcher in Korea, whose role is to respond quickly in accordance with the records schedule of the National Assembly, making its service faster than that of the administration. However, the power of the records management of the National Assembly Archives at the time of the Secretariat of the National Assembly was greatly reduced, so the revision of the regulations in accordance with the revised Act in 2007 was not completed until 2011. In the case of the National Assembly, the direct influence of the executive branch was insignificant. As the National Assembly had little direct influence on the administration, it had little positive influence on records management innovation under Roh Moo-Hyun Administration. Even within the National Assembly, the records management observed by its members is insignificant both in practice and in theory. As the National Assembly members are excluded from the Act, there is no legal basis to enforce a records management method upon them. In this study, we analyze the records management problem of the National Assembly members, which mainly concerns the National Assembly records management plan established in the National Archives. Moreover, this study proposes three kinds of records management methods for the National Assembly members, namely, the legislation and revision of regulations, the records management consulting of the National Assembly members, and the transfer of the dataset of administrative information systems and websites.

A Study on the impact of customer to customer interaction on customer value creation behavior (고객과 고객 간의 상호작용이 고객가치창출행동에 미치는 영향에 대한 연구)

  • Seo, Mun-Sik;Cho, Sang-Hyun
    • Management & Information Systems Review
    • /
    • v.37 no.2
    • /
    • pp.169-185
    • /
    • 2018
  • Customers are not merely responders but rather active value creators. As a result most researches related to customer value creation behavior focus on customer participation behavior and interaction between service provider and customer. This study set the research model to examine the correlation between customer to customer interaction, brand attachment and customer value creation behavior. For this study, the relationship among social support, C-to-C social interaction, similarity, brand attachment, and customer value creation behavior were modelled and used to validate our hypotheses. A path model was verified with structural equation modeling using dataset from survey. Results of this study are summarized as follows. First, this study show the C-to-C social interaction, such as social support, C-to-C social interaction, similarity have effects on brand attachment. Thus, this was statistically significant although dismissed from hypothesis verification. Second, the structural correlation shows brand attachment has positive effect on customer value creation behavior The findings suggest that managers need to identify and pay attention to positive customer to customer interaction in the service encounter so that it influence customer brand attachment and customer value creation behavior which is the competitive advantages of service brand.

Predicting Stock Liquidity by Using Ensemble Data Mining Methods

  • Bae, Eun Chan;Lee, Kun Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.6
    • /
    • pp.9-19
    • /
    • 2016
  • In finance literature, stock liquidity showing how stocks can be cashed out in the market has received rich attentions from both academicians and practitioners. The reasons are plenty. First, it is known that stock liquidity affects significantly asset pricing. Second, macroeconomic announcements influence liquidity in the stock market. Therefore, stock liquidity itself affects investors' decision and managers' decision as well. Though there exist a great deal of literature about stock liquidity in finance literature, it is quite clear that there are no studies attempting to investigate the stock liquidity issue as one of decision making problems. In finance literature, most of stock liquidity studies had dealt with limited views such as how much it influences stock price, which variables are associated with describing the stock liquidity significantly, etc. However, this paper posits that stock liquidity issue may become a serious decision-making problem, and then be handled by using data mining techniques to estimate its future extent with statistical validity. In this sense, we collected financial data set from a number of manufacturing companies listed in KRX (Korea Exchange) during the period of 2010 to 2013. The reason why we selected dataset from 2010 was to avoid the after-shocks of financial crisis that occurred in 2008. We used Fn-GuidPro system to gather total 5,700 financial data set. Stock liquidity measure was computed by the procedures proposed by Amihud (2002) which is known to show best metrics for showing relationship with daily return. We applied five data mining techniques (or classifiers) such as Bayesian network, support vector machine (SVM), decision tree, neural network, and ensemble method. Bayesian networks include GBN (General Bayesian Network), NBN (Naive BN), TAN (Tree Augmented NBN). Decision tree uses CART and C4.5. Regression result was used as a benchmarking performance. Ensemble method uses two types-integration of two classifiers, and three classifiers. Ensemble method is based on voting for the sake of integrating classifiers. Among the single classifiers, CART showed best performance with 48.2%, compared with 37.18% by regression. Among the ensemble methods, the result from integrating TAN, CART, and SVM was best with 49.25%. Through the additional analysis in individual industries, those relatively stabilized industries like electronic appliances, wholesale & retailing, woods, leather-bags-shoes showed better performance over 50%.

Development of a Detection Model for the Companies Designated as Administrative Issue in KOSDAQ Market (KOSDAQ 시장의 관리종목 지정 탐지 모형 개발)

  • Shin, Dong-In;Kwahk, Kee-Young
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.157-176
    • /
    • 2018
  • The purpose of this research is to develop a detection model for companies designated as administrative issue in KOSDAQ market using financial data. Administration issue designates the companies with high potential for delisting, which gives them time to overcome the reasons for the delisting under certain restrictions of the Korean stock market. It acts as an alarm to inform investors and market participants of which companies are likely to be delisted and warns them to make safe investments. Despite this importance, there are relatively few studies on administration issues prediction model in comparison with the lots of studies on bankruptcy prediction model. Therefore, this study develops and verifies the detection model of the companies designated as administrative issue using financial data of KOSDAQ companies. In this study, logistic regression and decision tree are proposed as the data mining models for detecting administrative issues. According to the results of the analysis, the logistic regression model predicted the companies designated as administrative issue using three variables - ROE(Earnings before tax), Cash flows/Shareholder's equity, and Asset turnover ratio, and its overall accuracy was 86% for the validation dataset. The decision tree (Classification and Regression Trees, CART) model applied the classification rules using Cash flows/Total assets and ROA(Net income), and the overall accuracy reached 87%. Implications of the financial indictors selected in our logistic regression and decision tree models are as follows. First, ROE(Earnings before tax) in the logistic detection model shows the profit and loss of the business segment that will continue without including the revenue and expenses of the discontinued business. Therefore, the weakening of the variable means that the competitiveness of the core business is weakened. If a large part of the profits is generated from one-off profit, it is very likely that the deterioration of business management is further intensified. As the ROE of a KOSDAQ company decreases significantly, it is highly likely that the company can be delisted. Second, cash flows to shareholder's equity represents that the firm's ability to generate cash flow under the condition that the financial condition of the subsidiary company is excluded. In other words, the weakening of the management capacity of the parent company, excluding the subsidiary's competence, can be a main reason for the increase of the possibility of administrative issue designation. Third, low asset turnover ratio means that current assets and non-current assets are ineffectively used by corporation, or that asset investment by corporation is excessive. If the asset turnover ratio of a KOSDAQ-listed company decreases, it is necessary to examine in detail corporate activities from various perspectives such as weakening sales or increasing or decreasing inventories of company. Cash flow / total assets, a variable selected by the decision tree detection model, is a key indicator of the company's cash condition and its ability to generate cash from operating activities. Cash flow indicates whether a firm can perform its main activities(maintaining its operating ability, repaying debts, paying dividends and making new investments) without relying on external financial resources. Therefore, if the index of the variable is negative(-), it indicates the possibility that a company has serious problems in business activities. If the cash flow from operating activities of a specific company is smaller than the net profit, it means that the net profit has not been cashed, indicating that there is a serious problem in managing the trade receivables and inventory assets of the company. Therefore, it can be understood that as the cash flows / total assets decrease, the probability of administrative issue designation and the probability of delisting are increased. In summary, the logistic regression-based detection model in this study was found to be affected by the company's financial activities including ROE(Earnings before tax). However, decision tree-based detection model predicts the designation based on the cash flows of the company.

Korean Sentence Generation Using Phoneme-Level LSTM Language Model (한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성)

  • Ahn, SungMahn;Chung, Yeojin;Lee, Jaejoon;Yang, Jiheon
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.71-88
    • /
    • 2017
  • Language models were originally developed for speech recognition and language processing. Using a set of example sentences, a language model predicts the next word or character based on sequential input data. N-gram models have been widely used but this model cannot model the correlation between the input units efficiently since it is a probabilistic model which are based on the frequency of each unit in the training set. Recently, as the deep learning algorithm has been developed, a recurrent neural network (RNN) model and a long short-term memory (LSTM) model have been widely used for the neural language model (Ahn, 2016; Kim et al., 2016; Lee et al., 2016). These models can reflect dependency between the objects that are entered sequentially into the model (Gers and Schmidhuber, 2001; Mikolov et al., 2010; Sundermeyer et al., 2012). In order to learning the neural language model, texts need to be decomposed into words or morphemes. Since, however, a training set of sentences includes a huge number of words or morphemes in general, the size of dictionary is very large and so it increases model complexity. In addition, word-level or morpheme-level models are able to generate vocabularies only which are contained in the training set. Furthermore, with highly morphological languages such as Turkish, Hungarian, Russian, Finnish or Korean, morpheme analyzers have more chance to cause errors in decomposition process (Lankinen et al., 2016). Therefore, this paper proposes a phoneme-level language model for Korean language based on LSTM models. A phoneme such as a vowel or a consonant is the smallest unit that comprises Korean texts. We construct the language model using three or four LSTM layers. Each model was trained using Stochastic Gradient Algorithm and more advanced optimization algorithms such as Adagrad, RMSprop, Adadelta, Adam, Adamax, and Nadam. Simulation study was done with Old Testament texts using a deep learning package Keras based the Theano. After pre-processing the texts, the dataset included 74 of unique characters including vowels, consonants, and punctuation marks. Then we constructed an input vector with 20 consecutive characters and an output with a following 21st character. Finally, total 1,023,411 sets of input-output vectors were included in the dataset and we divided them into training, validation, testsets with proportion 70:15:15. All the simulation were conducted on a system equipped with an Intel Xeon CPU (16 cores) and a NVIDIA GeForce GTX 1080 GPU. We compared the loss function evaluated for the validation set, the perplexity evaluated for the test set, and the time to be taken for training each model. As a result, all the optimization algorithms but the stochastic gradient algorithm showed similar validation loss and perplexity, which are clearly superior to those of the stochastic gradient algorithm. The stochastic gradient algorithm took the longest time to be trained for both 3- and 4-LSTM models. On average, the 4-LSTM layer model took 69% longer training time than the 3-LSTM layer model. However, the validation loss and perplexity were not improved significantly or became even worse for specific conditions. On the other hand, when comparing the automatically generated sentences, the 4-LSTM layer model tended to generate the sentences which are closer to the natural language than the 3-LSTM model. Although there were slight differences in the completeness of the generated sentences between the models, the sentence generation performance was quite satisfactory in any simulation conditions: they generated only legitimate Korean letters and the use of postposition and the conjugation of verbs were almost perfect in the sense of grammar. The results of this study are expected to be widely used for the processing of Korean language in the field of language processing and speech recognition, which are the basis of artificial intelligence systems.

A Prospect on the Changes in Short-term Cold Hardiness in "Campbell Early" Grapevine under the Future Warmer Winter in South Korea (남한의 겨울기온 상승 예측에 따른 포도 "캠벨얼리" 품종의 단기 내동성 변화 전망)

  • Chung, U-Ran;Yun, Jin-I.
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.10 no.3
    • /
    • pp.94-101
    • /
    • 2008
  • Warming trends during winter seasons in East Asian regions are expected to accelerate in the future according to the climate projection by the Inter-governmental Panel on Climate Change (IPCC). Warmer winters may affect short-term cold hardiness of deciduous fruit trees, and yet phenological observations are scant compared to long-term climate records in the regions. Dormancy depth, which can be estimated by daily temperature, is expected to serve as a reasonable proxy for physiological tolerance of flowering buds to low temperature in winter. In order to delineate the geographical pattern of short-term cold hardiness in grapevines, a selected dormancy depth model was parameterized for "Campbell Early", the major cultivar in South Korea. Gridded data sets of daily maximum and minimum temperature with a 270m cell spacing ("High Definition Digital Temperature Map", HDDTM) were prepared for the current climatological normal year (1971-2000) based on observations at the 56 Korea Meteorological Administration (KMA) stations and a geospatial interpolation scheme for correcting land surface effects (e.g., land use, topography, and site elevation). To generate relevant datasets for climatological normal years in the future, we combined a 25km-resolution, 2011-2100 temperature projection dataset covering South Korea (under the auspices of the IPCC-SRES A2 scenario) with the 1971-2000 HD-DTM. The dormancy depth model was run with the gridded datasets to estimate geographical pattern of change in the cold-hardiness period (the number of days between endo- and forced dormancy release) across South Korea for the normal years (1971-2000, 2011-2040, 2041-2070, and 2071-2100). Results showed that the cold-hardiness zone with 60 days or longer cold-tolerant period would diminish from 58% of the total land area of South Korea in 1971-2000 to 40% in 2011-2040, 14% in 2041-2070, and less than 3% in 2071-2100. This method can be applied to other deciduous fruit trees for delineating geographical shift of cold-hardiness zone under the projected climate change in the future, thereby providing valuable information for adaptation strategy in fruit industry.

A Study on the Development Trend of Artificial Intelligence Using Text Mining Technique: Focused on Open Source Software Projects on Github (텍스트 마이닝 기법을 활용한 인공지능 기술개발 동향 분석 연구: 깃허브 상의 오픈 소스 소프트웨어 프로젝트를 대상으로)

  • Chong, JiSeon;Kim, Dongsung;Lee, Hong Joo;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.1-19
    • /
    • 2019
  • Artificial intelligence (AI) is one of the main driving forces leading the Fourth Industrial Revolution. The technologies associated with AI have already shown superior abilities that are equal to or better than people in many fields including image and speech recognition. Particularly, many efforts have been actively given to identify the current technology trends and analyze development directions of it, because AI technologies can be utilized in a wide range of fields including medical, financial, manufacturing, service, and education fields. Major platforms that can develop complex AI algorithms for learning, reasoning, and recognition have been open to the public as open source projects. As a result, technologies and services that utilize them have increased rapidly. It has been confirmed as one of the major reasons for the fast development of AI technologies. Additionally, the spread of the technology is greatly in debt to open source software, developed by major global companies, supporting natural language recognition, speech recognition, and image recognition. Therefore, this study aimed to identify the practical trend of AI technology development by analyzing OSS projects associated with AI, which have been developed by the online collaboration of many parties. This study searched and collected a list of major projects related to AI, which were generated from 2000 to July 2018 on Github. This study confirmed the development trends of major technologies in detail by applying text mining technique targeting topic information, which indicates the characteristics of the collected projects and technical fields. The results of the analysis showed that the number of software development projects by year was less than 100 projects per year until 2013. However, it increased to 229 projects in 2014 and 597 projects in 2015. Particularly, the number of open source projects related to AI increased rapidly in 2016 (2,559 OSS projects). It was confirmed that the number of projects initiated in 2017 was 14,213, which is almost four-folds of the number of total projects generated from 2009 to 2016 (3,555 projects). The number of projects initiated from Jan to Jul 2018 was 8,737. The development trend of AI-related technologies was evaluated by dividing the study period into three phases. The appearance frequency of topics indicate the technology trends of AI-related OSS projects. The results showed that the natural language processing technology has continued to be at the top in all years. It implied that OSS had been developed continuously. Until 2015, Python, C ++, and Java, programming languages, were listed as the top ten frequently appeared topics. However, after 2016, programming languages other than Python disappeared from the top ten topics. Instead of them, platforms supporting the development of AI algorithms, such as TensorFlow and Keras, are showing high appearance frequency. Additionally, reinforcement learning algorithms and convolutional neural networks, which have been used in various fields, were frequently appeared topics. The results of topic network analysis showed that the most important topics of degree centrality were similar to those of appearance frequency. The main difference was that visualization and medical imaging topics were found at the top of the list, although they were not in the top of the list from 2009 to 2012. The results indicated that OSS was developed in the medical field in order to utilize the AI technology. Moreover, although the computer vision was in the top 10 of the appearance frequency list from 2013 to 2015, they were not in the top 10 of the degree centrality. The topics at the top of the degree centrality list were similar to those at the top of the appearance frequency list. It was found that the ranks of the composite neural network and reinforcement learning were changed slightly. The trend of technology development was examined using the appearance frequency of topics and degree centrality. The results showed that machine learning revealed the highest frequency and the highest degree centrality in all years. Moreover, it is noteworthy that, although the deep learning topic showed a low frequency and a low degree centrality between 2009 and 2012, their ranks abruptly increased between 2013 and 2015. It was confirmed that in recent years both technologies had high appearance frequency and degree centrality. TensorFlow first appeared during the phase of 2013-2015, and the appearance frequency and degree centrality of it soared between 2016 and 2018 to be at the top of the lists after deep learning, python. Computer vision and reinforcement learning did not show an abrupt increase or decrease, and they had relatively low appearance frequency and degree centrality compared with the above-mentioned topics. Based on these analysis results, it is possible to identify the fields in which AI technologies are actively developed. The results of this study can be used as a baseline dataset for more empirical analysis on future technology trends that can be converged.

Value of Information Technology Outsourcing: An Empirical Analysis of Korean Industries (IT 아웃소싱의 가치에 관한 연구: 한국 산업에 대한 실증분석)

  • Han, Kun-Soo;Lee, Kang-Bae
    • Asia pacific journal of information systems
    • /
    • v.20 no.3
    • /
    • pp.115-137
    • /
    • 2010
  • Information technology (IT) outsourcing, the use of a third-party vendor to provide IT services, started in the late 1980s and early 1990s in Korea, and has increased rapidly since 2000. Recently, firms have increased their efforts to capture greater value from IT outsourcing. To date, there have been a large number of studies on IT outsourcing. Most prior studies on IT outsourcing have focused on outsourcing practices and decisions, and little attention has been paid to objectively measuring the value of IT outsourcing. In addition, studies that examined the performance of IT outsourcing have mainly relied on anecdotal evidence or practitioners' perceptions. Our study examines the contribution of IT outsourcing to economic growth in Korean industries over the 1990 to 2007 period, using a production function framework and a panel data set for 54 industries constructed from input-output tables, fixed-capital formation tables, and employment tables. Based on the framework and estimation procedures that Han, Kauffman and Nault (2010) used to examine the economic impact of IT outsourcing in U.S. industries, we evaluate the impact of IT outsourcing on output and productivity in Korean industries. Because IT outsourcing started to grow at a significantly more rapid pace in 2000, we compare the impact of IT outsourcing in pre- and post-2000 periods. Our industry-level panel data cover a large proportion of Korean economy-54 out of 58 Korean industries. This allows us greater opportunity to assess the impacts of IT outsourcing on objective performance measures, such as output and productivity. Using IT outsourcing and IT capital as our primary independent variables, we employ an extended Cobb-Douglas production function in which both variables are treated as factor inputs. We also derive and estimate a labor productivity equation to assess the impact of our IT variables on labor productivity. We use data from seven years (1990, 1993, 2000, 2003, 2005, 2006, and 2007) for which both input-output tables and fixed-capital formation tables are available. Combining the input-output tables and fixed-capital formation tables resulted in 54 industries. IT outsourcing is measured as the value of computer-related services purchased by each industry in a given year. All the variables have been converted to 2000 Korean Won using GDP deflators. To calculate labor hours, we use the average work hours for each sector provided by the OECD. To effectively control for heteroskedasticity and autocorrelation present in our dataset, we use the feasible generalized least squares (FGLS) procedures. Because the AR1 process may be industry-specific (i.e., panel-specific), we consider both common AR1 and panel-specific AR1 (PSAR1) processes in our estimations. We also include year dummies to control for year-specific effects common across industries, and sector dummies (as defined in the GDP deflator) to control for time-invariant sector-specific effects. Based on the full sample of 378 observations, we find that a 1% increase in IT outsourcing is associated with a 0.012~0.014% increase in gross output and a 1% increase in IT capital is associated with a 0.024~0.027% increase in gross output. To compare the contribution of IT outsourcing relative to that of IT capital, we examined gross marginal product (GMP). The average GMP of IT outsourcing was 6.423, which is substantially greater than that of IT capital at 2.093. This indicates that on average if an industry invests KRW 1 millon, it can increase its output by KRW 6.4 million. In terms of the contribution to labor productivity, we find that a 1% increase in IT outsourcing is associated with a 0.009~0.01% increase in labor productivity while a 1% increase in IT capital is associated with a 0.024~0.025% increase in labor productivity. Overall, our results indicate that IT outsourcing has made positive and economically meaningful contributions to output and productivity in Korean industries over the 1990 to 2007 period. The average GMP of IT outsourcing we report about Korean industries is 1.44 times greater than that in U.S. industries reported in Han et al. (2010). Further, we find that the contribution of IT outsourcing has been significantly greater in the 2000~2007 period during which the growth of IT outsourcing accelerated. Our study provides implication for policymakers and managers. First, our results suggest that Korean industries can capture further benefits by increasing investments in IT outsourcing. Second, our analyses and results provide a basis for managers to assess the impact of investments in IT outsourcing and IT capital in an objective and quantitative manner. Building on our study, future research should examine the impact of IT outsourcing at a more detailed industry level and the firm level.

A Two-Stage Learning Method of CNN and K-means RGB Cluster for Sentiment Classification of Images (이미지 감성분류를 위한 CNN과 K-means RGB Cluster 이-단계 학습 방안)

  • Kim, Jeongtae;Park, Eunbi;Han, Kiwoong;Lee, Junghyun;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.139-156
    • /
    • 2021
  • The biggest reason for using a deep learning model in image classification is that it is possible to consider the relationship between each region by extracting each region's features from the overall information of the image. However, the CNN model may not be suitable for emotional image data without the image's regional features. To solve the difficulty of classifying emotion images, many researchers each year propose a CNN-based architecture suitable for emotion images. Studies on the relationship between color and human emotion were also conducted, and results were derived that different emotions are induced according to color. In studies using deep learning, there have been studies that apply color information to image subtraction classification. The case where the image's color information is additionally used than the case where the classification model is trained with only the image improves the accuracy of classifying image emotions. This study proposes two ways to increase the accuracy by incorporating the result value after the model classifies an image's emotion. Both methods improve accuracy by modifying the result value based on statistics using the color of the picture. When performing the test by finding the two-color combinations most distributed for all training data, the two-color combinations most distributed for each test data image were found. The result values were corrected according to the color combination distribution. This method weights the result value obtained after the model classifies an image's emotion by creating an expression based on the log function and the exponential function. Emotion6, classified into six emotions, and Artphoto classified into eight categories were used for the image data. Densenet169, Mnasnet, Resnet101, Resnet152, and Vgg19 architectures were used for the CNN model, and the performance evaluation was compared before and after applying the two-stage learning to the CNN model. Inspired by color psychology, which deals with the relationship between colors and emotions, when creating a model that classifies an image's sentiment, we studied how to improve accuracy by modifying the result values based on color. Sixteen colors were used: red, orange, yellow, green, blue, indigo, purple, turquoise, pink, magenta, brown, gray, silver, gold, white, and black. It has meaning. Using Scikit-learn's Clustering, the seven colors that are primarily distributed in the image are checked. Then, the RGB coordinate values of the colors from the image are compared with the RGB coordinate values of the 16 colors presented in the above data. That is, it was converted to the closest color. Suppose three or more color combinations are selected. In that case, too many color combinations occur, resulting in a problem in which the distribution is scattered, so a situation fewer influences the result value. Therefore, to solve this problem, two-color combinations were found and weighted to the model. Before training, the most distributed color combinations were found for all training data images. The distribution of color combinations for each class was stored in a Python dictionary format to be used during testing. During the test, the two-color combinations that are most distributed for each test data image are found. After that, we checked how the color combinations were distributed in the training data and corrected the result. We devised several equations to weight the result value from the model based on the extracted color as described above. The data set was randomly divided by 80:20, and the model was verified using 20% of the data as a test set. After splitting the remaining 80% of the data into five divisions to perform 5-fold cross-validation, the model was trained five times using different verification datasets. Finally, the performance was checked using the test dataset that was previously separated. Adam was used as the activation function, and the learning rate was set to 0.01. The training was performed as much as 20 epochs, and if the validation loss value did not decrease during five epochs of learning, the experiment was stopped. Early tapping was set to load the model with the best validation loss value. The classification accuracy was better when the extracted information using color properties was used together than the case using only the CNN architecture.

Bankruptcy prediction using an improved bagging ensemble (개선된 배깅 앙상블을 활용한 기업부도예측)

  • Min, Sung-Hwan
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.4
    • /
    • pp.121-139
    • /
    • 2014
  • Predicting corporate failure has been an important topic in accounting and finance. The costs associated with bankruptcy are high, so the accuracy of bankruptcy prediction is greatly important for financial institutions. Lots of researchers have dealt with the topic associated with bankruptcy prediction in the past three decades. The current research attempts to use ensemble models for improving the performance of bankruptcy prediction. Ensemble classification is to combine individually trained classifiers in order to gain more accurate prediction than individual models. Ensemble techniques are shown to be very useful for improving the generalization ability of the classifier. Bagging is the most commonly used methods for constructing ensemble classifiers. In bagging, the different training data subsets are randomly drawn with replacement from the original training dataset. Base classifiers are trained on the different bootstrap samples. Instance selection is to select critical instances while deleting and removing irrelevant and harmful instances from the original set. Instance selection and bagging are quite well known in data mining. However, few studies have dealt with the integration of instance selection and bagging. This study proposes an improved bagging ensemble based on instance selection using genetic algorithms (GA) for improving the performance of SVM. GA is an efficient optimization procedure based on the theory of natural selection and evolution. GA uses the idea of survival of the fittest by progressively accepting better solutions to the problems. GA searches by maintaining a population of solutions from which better solutions are created rather than making incremental changes to a single solution to the problem. The initial solution population is generated randomly and evolves into the next generation by genetic operators such as selection, crossover and mutation. The solutions coded by strings are evaluated by the fitness function. The proposed model consists of two phases: GA based Instance Selection and Instance based Bagging. In the first phase, GA is used to select optimal instance subset that is used as input data of bagging model. In this study, the chromosome is encoded as a form of binary string for the instance subset. In this phase, the population size was set to 100 while maximum number of generations was set to 150. We set the crossover rate and mutation rate to 0.7 and 0.1 respectively. We used the prediction accuracy of model as the fitness function of GA. SVM model is trained on training data set using the selected instance subset. The prediction accuracy of SVM model over test data set is used as fitness value in order to avoid overfitting. In the second phase, we used the optimal instance subset selected in the first phase as input data of bagging model. We used SVM model as base classifier for bagging ensemble. The majority voting scheme was used as a combining method in this study. This study applies the proposed model to the bankruptcy prediction problem using a real data set from Korean companies. The research data used in this study contains 1832 externally non-audited firms which filed for bankruptcy (916 cases) and non-bankruptcy (916 cases). Financial ratios categorized as stability, profitability, growth, activity and cash flow were investigated through literature review and basic statistical methods and we selected 8 financial ratios as the final input variables. We separated the whole data into three subsets as training, test and validation data set. In this study, we compared the proposed model with several comparative models including the simple individual SVM model, the simple bagging model and the instance selection based SVM model. The McNemar tests were used to examine whether the proposed model significantly outperforms the other models. The experimental results show that the proposed model outperforms the other models.