• Title/Summary/Keyword: 로지스틱

Search Result 1,996, Processing Time 0.031 seconds

Perceptions of Married Women on Childbirth and Sex Preference and Related Factors in Gyeongju, Korea (도농복합지역 기혼여성들의 출산과 성 선호에 대한 인식 및 관련요인)

  • Youm, Seog-Heon;Kang, Pock-Soo;Kim, Chang-Yoon;Lee, Kyeong-Soo;Hwang, Tae-Yoon;Hwang, In-Sob
    • Journal of agricultural medicine and community health
    • /
    • v.35 no.3
    • /
    • pp.260-273
    • /
    • 2010
  • Objectives: The purpose of this study was to investigate the perceptions of married Korean women regarding marriage and childbirth, and their awareness of childbirth-related issues such as low birth rates, sex preferences and sex imbalances in Korea. Methods: A total of 453 married women aged 20 or older were randomly selected from four urban districts and five rural districts out of 25 districts in Gyeongju, a consolidated city located in Gyeongsangbuk-do Province, South Korea. The survey was conducted from December 2005 to February 2006. A total of 392 out of 453 questionnaires(86.5% response rate) were collected, and 44 incomplete questionnaires were excluded, leaving 348 completed questionnaires to be used for data analysis. Age was divided into three groups as below 49, 50-69, 70 or older. Results: Women's perceptions of marriage were associated with age(p<0.01). Perceptions about childbirth were also significantly related to age(p<0.01), type of residential area (p<0.01) and education level(p<0.05). Sex preferences were significantly related to age(p<0.05) and occupation(p<0.01). Of the respondents aged 49 or younger, 34.8% indicated that the ideal number of children is two, while 25.5% of respondents aged 50 to 69 and 15.3% of respondents aged 70 and 33.7% of respondents aged 70 or older considered four children to be the ideal number. Perceptions of sex imbalance were significantly related to socioeconomic status(p<0.01) and occupation(p<0.01). The largest number of respondents cited "economic burden" as the main reason for low birth rates. Multiple logistic regressions were performed for all three age groups using male sex preference as the dependent variable under the assumption that respondents can have only a single child. Socioeconomic status (p<0.01) and residential area (p<0.05) were significant variables for those aged 49 or below. Education level(p<0.05) and residential area (p<0.01) were statistically significant variables on preferring son in case of having only one child for respondents aged 50 to 69. We did not detect any significant independent variables in respondents who were 70 or older. Conclusions: Our results highlight the necessity of developing policies and public education programs to explain the consequences of low birth rates and sex imbalances in Korea. As increasing numbers of women work outside the home, it is important for the government and employers to provide social and working environments where women do not consider marriage and childbirth to be obstacles to social and business activities.

A Time Series Graph based Convolutional Neural Network Model for Effective Input Variable Pattern Learning : Application to the Prediction of Stock Market (효과적인 입력변수 패턴 학습을 위한 시계열 그래프 기반 합성곱 신경망 모형: 주식시장 예측에의 응용)

  • Lee, Mo-Se;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.167-181
    • /
    • 2018
  • Over the past decade, deep learning has been in spotlight among various machine learning algorithms. In particular, CNN(Convolutional Neural Network), which is known as the effective solution for recognizing and classifying images or voices, has been popularly applied to classification and prediction problems. In this study, we investigate the way to apply CNN in business problem solving. Specifically, this study propose to apply CNN to stock market prediction, one of the most challenging tasks in the machine learning research. As mentioned, CNN has strength in interpreting images. Thus, the model proposed in this study adopts CNN as the binary classifier that predicts stock market direction (upward or downward) by using time series graphs as its inputs. That is, our proposal is to build a machine learning algorithm that mimics an experts called 'technical analysts' who examine the graph of past price movement, and predict future financial price movements. Our proposed model named 'CNN-FG(Convolutional Neural Network using Fluctuation Graph)' consists of five steps. In the first step, it divides the dataset into the intervals of 5 days. And then, it creates time series graphs for the divided dataset in step 2. The size of the image in which the graph is drawn is $40(pixels){\times}40(pixels)$, and the graph of each independent variable was drawn using different colors. In step 3, the model converts the images into the matrices. Each image is converted into the combination of three matrices in order to express the value of the color using R(red), G(green), and B(blue) scale. In the next step, it splits the dataset of the graph images into training and validation datasets. We used 80% of the total dataset as the training dataset, and the remaining 20% as the validation dataset. And then, CNN classifiers are trained using the images of training dataset in the final step. Regarding the parameters of CNN-FG, we adopted two convolution filters ($5{\times}5{\times}6$ and $5{\times}5{\times}9$) in the convolution layer. In the pooling layer, $2{\times}2$ max pooling filter was used. The numbers of the nodes in two hidden layers were set to, respectively, 900 and 32, and the number of the nodes in the output layer was set to 2(one is for the prediction of upward trend, and the other one is for downward trend). Activation functions for the convolution layer and the hidden layer were set to ReLU(Rectified Linear Unit), and one for the output layer set to Softmax function. To validate our model - CNN-FG, we applied it to the prediction of KOSPI200 for 2,026 days in eight years (from 2009 to 2016). To match the proportions of the two groups in the independent variable (i.e. tomorrow's stock market movement), we selected 1,950 samples by applying random sampling. Finally, we built the training dataset using 80% of the total dataset (1,560 samples), and the validation dataset using 20% (390 samples). The dependent variables of the experimental dataset included twelve technical indicators popularly been used in the previous studies. They include Stochastic %K, Stochastic %D, Momentum, ROC(rate of change), LW %R(Larry William's %R), A/D oscillator(accumulation/distribution oscillator), OSCP(price oscillator), CCI(commodity channel index), and so on. To confirm the superiority of CNN-FG, we compared its prediction accuracy with the ones of other classification models. Experimental results showed that CNN-FG outperforms LOGIT(logistic regression), ANN(artificial neural network), and SVM(support vector machine) with the statistical significance. These empirical results imply that converting time series business data into graphs and building CNN-based classification models using these graphs can be effective from the perspective of prediction accuracy. Thus, this paper sheds a light on how to apply deep learning techniques to the domain of business problem solving.

Bioacoustics and Habitat Environment Analysis of Cicadas in Taebaeksan National Park (태백산국립공원에 서식하는 매미류의 생물음향 및 서식환경 분석)

  • Kim, Yoon-Jae;Jung, Tae-Jun;Ki, Kyong-Seok
    • Korean Journal of Environment and Ecology
    • /
    • v.33 no.6
    • /
    • pp.664-676
    • /
    • 2019
  • This study aimed to analyze the bioacoustics and habitat environment of the cicadas inhabiting Taebaeksan National Park, an sub-alpine region in Korea. The mating calls of the cicadas were recorded for approximately 3 months, between July and September of 2018. The recording devices were installed in Daedeoksan valley and Baekcheon valley, inside Taebaeksan National Park, and the sounds were recorded 24 hours a day. In order to obtain the habitat distribution data of the cicadas, the sounds were recorded from 111 spots located in the Taebaeksan National Park trail in August 2018. The daily weather data was obtained from the Taebaek city weather center. The results of the study demonstrated that 5 species of cicadas inhabit Taebaeksan National Park, namely, Leptosemia takanonis, Lyristes intermedius, Kosemia yezoensis, Hyalessa fuscata, and Meimuna opalifera. The time of appearance for L. takanonis was early July to mid-July, and that for L. intermedius, K. yezoensis, H. fuscata, and M. opalifera was mid-July to early September. Analysis of the circadian rhythm revealed that L. intermedius, K. yezoensis, and H. fuscata started producing mating calls between 6:00 and 7:00, which ended at around 19:00 for all the three species. The peak time for producing mating calls was 11:00 for L. intermedius, 12:00 for H. fuscata, and around 13:00 to 14:00 for K. yezoensis. The environmental factors influencing the mating calls of the cicadas inhabiting Taebaeksan National Park were analyzed by logistic regression. The results showed that the probability of producing mating calls increased by 1.192 and 1.279 times in L. intermedius and K. yezoensis, respectively, when the average temperature increased by one degree. When the duration of sunlight increased by one hour, the probability of producing mating calls increased by 4.366 and 2.624 times in L. intermedius and H. fuscata, respectively. Analysis of the interspecific effects revealed that when H. fuscata produced a single mating call, the probability of producing mating calls increased by 14.620 and 2.784 times in L. intermedius and K. yezoensis, respectively. When K. yezoensis and L. intermedius produced mating calls, the probability of producing mating calls in H. fuscata increased by 11.301 and 2.474 times, respectively. L. intermedius and K. yezoensis did not have any effects on each other with respect to the production of mating calls. Analysis of the habitat environment of each species revealed that their habitats were located at altitudes of 1,046 m (780~1,315 m) for L. intermedius, 1,072 m (762~1,361 m) for K. yezoensis, and 976 m (686~1,245 m) for H. fuscata. Unlike H. fuscata, which was found at a low altitude, K. yezoensis and L. intermedius were not found at altitudes lower than 700 m. Analysis of the average aspect of the habitats of each of the cicada species revealed that L. intermedius was found at 166° (125~207°), K. yezoensis was found at 100° (72~128°), and H. fuscata was found at 173° (118~228°). Examination of the distribution of each of the cicada species revealed that they were predominantly distributed in the ridges and slopes located in the southeastern region of Munsubong in Taebaeksan. In summary, L. intermedius and K. yezoensis was found to inhabit higher altitudes in Taebacksan National Park than H. fuscata, which was found throughout the Korean peninsula. Additionally, the main aspect of the cicada habitat was found to be the southeastern region (100~173°), which has good access to daylight.

A Study on the Revitalization of Tourism Industry through Big Data Analysis (한국관광 실태조사 빅 데이터 분석을 통한 관광산업 활성화 방안 연구)

  • Lee, Jungmi;Liu, Meina;Lim, Gyoo Gun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.149-169
    • /
    • 2018
  • Korea is currently accumulating a large amount of data in public institutions based on the public data open policy and the "Government 3.0". Especially, a lot of data is accumulated in the tourism field. However, the academic discussions utilizing the tourism data are still limited. Moreover, the openness of the data of restaurants, hotels, and online tourism information, and how to use SNS Big Data in tourism are still limited. Therefore, utilization through tourism big data analysis is still low. In this paper, we tried to analyze influencing factors on foreign tourists' satisfaction in Korea through numerical data using data mining technique and R programming technique. In this study, we tried to find ways to revitalize the tourism industry by analyzing about 36,000 big data of the "Survey on the actual situation of foreign tourists from 2013 to 2015" surveyed by the Korea Culture & Tourism Research Institute. To do this, we analyzed the factors that have high influence on the 'Satisfaction', 'Revisit intention', and 'Recommendation' variables of foreign tourists. Furthermore, we analyzed the practical influences of the variables that are mentioned above. As a procedure of this study, we first integrated survey data of foreign tourists conducted by Korea Culture & Tourism Research Institute, which is stored in the tourist information system from 2013 to 2015, and eliminate unnecessary variables that are inconsistent with the research purpose among the integrated data. Some variables were modified to improve the accuracy of the analysis. And we analyzed the factors affecting the dependent variables by using data-mining methods: decision tree(C5.0, CART, CHAID, QUEST), artificial neural network, and logistic regression analysis of SPSS IBM Modeler 16.0. The seven variables that have the greatest effect on each dependent variable were derived. As a result of data analysis, it was found that seven major variables influencing 'overall satisfaction' were sightseeing spot attraction, food satisfaction, accommodation satisfaction, traffic satisfaction, guide service satisfaction, number of visiting places, and country. Variables that had a great influence appeared food satisfaction and sightseeing spot attraction. The seven variables that had the greatest influence on 'revisit intention' were the country, travel motivation, activity, food satisfaction, best activity, guide service satisfaction and sightseeing spot attraction. The most influential variables were food satisfaction and travel motivation for Korean style. Lastly, the seven variables that have the greatest influence on the 'recommendation intention' were the country, sightseeing spot attraction, number of visiting places, food satisfaction, activity, tour guide service satisfaction and cost. And then the variables that had the greatest influence were the country, sightseeing spot attraction, and food satisfaction. In addition, in order to grasp the influence of each independent variables more deeply, we used R programming to identify the influence of independent variables. As a result, it was found that the food satisfaction and sightseeing spot attraction were higher than other variables in overall satisfaction and had a greater effect than other influential variables. Revisit intention had a higher ${\beta}$ value in the travel motive as the purpose of Korean Wave than other variables. It will be necessary to have a policy that will lead to a substantial revisit of tourists by enhancing tourist attractions for the purpose of Korean Wave. Lastly, the recommendation had the same result of satisfaction as the sightseeing spot attraction and food satisfaction have higher ${\beta}$ value than other variables. From this analysis, we found that 'food satisfaction' and 'sightseeing spot attraction' variables were the common factors to influence three dependent variables that are mentioned above('Overall satisfaction', 'Revisit intention' and 'Recommendation'), and that those factors affected the satisfaction of travel in Korea significantly. The purpose of this study is to examine how to activate foreign tourists in Korea through big data analysis. It is expected to be used as basic data for analyzing tourism data and establishing effective tourism policy. It is expected to be used as a material to establish an activation plan that can contribute to tourism development in Korea in the future.

A Study on Farmer's Syndrome and Its Risk Factors of Vinyl House Worker and Farmer in a Rural Area (일부 농촌지역 비닐하우스 재배자들의 농부증 실태와 관련요인)

  • Lee, In-Bae;Lee, Yeon-Kyeong;Chang, Sung-Sil;Lee, Sok-Goo;Cho, Young-Che;Lee, Dong-Bae;Lee, Tae-Yong
    • Journal of agricultural medicine and community health
    • /
    • v.24 no.1
    • /
    • pp.13-33
    • /
    • 1999
  • The aim of this study was to investigate fatigue scores, physical complaints, farmer's syndrome and to find out its risk factors among farmers. The questionnaire survey was conducted to 177 vinyl house workers and 213 farmers who lived in Chongyang gun of Chungnam province from February 24 to March 15, 1998. The obtained main results were followings; 1. The fatigue scores were not significantly different between vinyl house workers and farmers. The fatigue scores were higher in female group, lower education group, shorter sleep hours group(under 8 hours), nonsmoker, nondrinker group than otherwise groups. There was not statistically significant difference between the mean fatigue scores and age, eating habit and body mass index. Duration of farming years in vinyl house and farming area and number of farming workers in farmers family showed a slight relationship with the fatigue score. 2. Health scores were not different between vinyl house workers and farmers. The health states was poorer in female group, lower education group, shorter sleep hours group(under 8 hours), nonsmoker group, and nondrinker group than otherwise groups by health scores. Health scores were not related with age, eating habit and body mass index. 3. The proportion of farmer's syndrome was 49.1% in vinyl house workers and 52.1% in farmers. That was higher in female than in male and the higher proportion was found in the lower education group of vinyl house workers and farmers. The proportion of farmer's syndrome was higher in the group of smoker, alcohol drinkers and over or under weight in vinyl house workers, but did not differ in those of farmers. 4. By multiple logistic regression, sex and sleep hours were risk factors affecting to farmer's syndrome. Odds ration for female group was 2.53 (reference group was male) and that for over 8 sleep hours group was 0.74 (reference group was under 8 sleep hours group). 5. The chief complaints by CMI were "I am difficult to work due to aching the back and the limbs", "I feel prickle pain in the limbs", "I sometimes have a twinge in the limbs", "I am not quite well as having a pain in the limbs", "I feel weaker grasping power than before" in both of vinyl house workers and farmers. Vinyl house workers more frequently pointed out skin darkening, skin disease and hemorrhoids than farmers. 6. According to correspondence analysis, skin disease of vinyl house workers was related to vinyl house farmers and digestive and general symptom was associated with male and endocrinological and muscular symptom was associated with female in vinyl house workers. And it revealed that farmer's syndrome was highly related with female and farmers relatively. By the above results, the fatigue scores, perceive health and farmer's syndrome did not much differ in two groups, but aged female farmers should be considered as female farmers represented higher fatigue score, farmers syndrome and poorer perceive health than male farmers in addition to farmer's syndrome was increased with ageing process. Also feeble but distinguished symptoms which might be due to working environment were observed especially in vinyl house workers and that should be considered and investigated continuously.

  • PDF

Health Concern, Health Practice and ADL of The Elderly Who Stay at Home in a Rural Community (농촌(農村) 재택노인(財宅老人)들의 건강관심도(健康關心度), 건강실천행위(健康實踐行爲)와 일상생활동작능력(日常生活動作能力))

  • Eom, Young-Hee;Kam, Sin;Han, Chang-Hyun;Cha, Byung-Jun;Kim, Sang-Soon
    • Journal of agricultural medicine and community health
    • /
    • v.24 no.2
    • /
    • pp.269-289
    • /
    • 1999
  • This study was conducted to examine the relationship among health concern, health practice and ADL of elderly staying at home in a rural community and their affecting factors. Data were collected through direct interviews made with 480 old people aged more than sixty-five from November 15, 1998 to December 20, 1998. Out of 189 male and 291 female, the high-level group that showed high health concern accounted for 44.4%, the medium-level group for 13.1%, and the low-level group for 42.5%, in the health practice, the high-level group accounted for 3.8%, the medium-level group for 18.8%, and the low-level group for 77.5%. In the self-rated health status, the high-level group accounted for 29.0%, the medium-level group for 31.0%, and the low-level group for 40.0%, and in the ADL, the high ADL group accounted for 91.5%, and the low-level ADL group for 8.5%. The result of the chi-square test showed that for male, there was a significant relation between the health concern and the health practice index score. In the relation between the health practice index score and the self-rated health status, there was significant positive relationship between health practice index and self-rated health status, and in the relation between the health practice Index score and the ADL, old people with higher health practices showed good ADL(but not significant). Old people with good ADL also showed good self-rated health status. In the multiple regression analysis where the health practice was used as a dependent variable, the health concern was added to the sociodemographic variables as an independent variables, a formula was formed for male old people only and ones with high concern in health showed good health practice. In the multiple logistic regression analysis where the sociodemographic variables to which the health practices was added were used as an independent variable and the ADL as a dependent variable, the ADL appeared to be not good if for male old people the living costs were born by their sons and daughters and as for female old people their ages increased, but it was good if old people had sources of health information such as hospitals or health centers. The self-rated health status was worse, for male old people, if they had short living costs or diseases and for female old people, if they had spouses, living costs born by their sons and daughters or diseases, but it was better, for male old people, if they had periodical gatherings or carried out health practices a lot, and for female old people, if they had sources of health information such as hospitals or health centers or carried out health practices a lot. In view of the results stated above, the higher the old people had health concern, the more they carried out health practices, and the more they carried out health practices, the better they had ADL and self-rated health status that served as the level of health. Further, the better ADL, the better self-rated health status.

  • PDF

The Physical and Social Disability of Aged Persons Who Live Alone in Goksung Area (곡성지역(谷城地域) 독거노인(獨居老人)의 신체적(身體的) 사회적(社會的) 능력장애(能力障碍)에 관(關)한 조사(調査))

  • Kim, Shin-Woel;Kim, Young-Lak;Ryu, So-Yeon;Park, Jong;Kim, Ki-Soon;Kim, Yang-Ok
    • Journal of agricultural medicine and community health
    • /
    • v.24 no.2
    • /
    • pp.245-268
    • /
    • 1999
  • It is necessary that the old should have the physical and social ability to perform their daily life. This study is to grasp their degree of disability and problems and suggest their solutions. It surveyed the 87 old people over 65 years old from September 1st until September 30th, in 1997. The findings are as follows. 1) The activities of daily living(ADL) to find their degree of physical disability shows that their average performance ability is 75.9% of all the action while 24.1% of all the old people needs the others' help. As they get older and older, the aged drop off in their physical ability, which is related to a statistical sense (p<0.001). 2) The social disability shows that the aged have their great difference from 9.2% to 85.1% in their instrumental activities of daily living(IADL), intellectual ability and social role. 3) A simple analysis shows that the activities of daily living are, in a statistical sense, related to age(p<0.001), the use of elder's hall(p<0.001), the understanding degree of health(p<0.01) and so forth. 4) A simple analysis shows that the instrumental activities of daily living are, in a statistical sense, related to age(p<0.001), the degree of education(p<0.05), the life of leisure(p<0.001), the understanding degree of health and so forth. 5) A multivariate logistic regression analysis shows that the disability of daily living is related to age, the visit of elder's hall, the period of solitary living, instrumental activities of daily living is age and the visit of elder's hall, and social role is the visit of elder's hall and the decree of education, while intellectual activity has no related variables in a statistical sense.

  • PDF

Association of Serum Copper and Zinc Levels with Liver Cirrhosis and Hepatocellular Carcinoma (간경변 및 간암과 혈청 구리와 아연농도와의 관련성)

  • Hyun, Myung-Soo;Suh, Suk-Kwon;Yoon, Nung-Ki;Lee, Jong-Young;Lee, Seoung-Hoon;Lee, Mu-Sik
    • Journal of Preventive Medicine and Public Health
    • /
    • v.25 no.2 s.38
    • /
    • pp.127-140
    • /
    • 1992
  • This study was done to identify the association between serum copper and zinc levels and the cirrhosis and hepatocellular carcinoma(HCC), and to evaluate its diagnostic value on liver diseases. Sixty-three healthy persons, 60 patients with cirrhosis and 33 patients with hepatocellular carcinoma were rendomly selected and investigated for their general characteristics from October 1990 to August 1991. For analysis of the biochemical markers in liver function test and the serum copper and zinc levels, their fasting venous blood were sampled at 9:00 to 11:00 in the morning and centrifuged to separate the serum within one hour. All the samples were immediately analysed for biochemical markers and stored at $-20^{\circ}C$ in polypropylene tubes further copper and zinc analysis. Mean of serum coppper levels was $91.97{\pm}4.76{\mu}g/dl$ in control, $106.21{\pm}2.73{\mu}g/dl$ in cirrhosis and $127.05{\pm}0.77{\mu}g/dl$ in HCC. The value of HCC was statistically significantly higher than that of the control and cirrhosis(p<0.05). Serum zinc levels were $110.82{\pm}7.24{\mu}g/dl$ in control, $68.10{\pm}5.43{\mu}g/dl$ in cirrhosis and $63.78{\pm}2.20{\mu}g/dl$ in HCC. The values of cirrhosis and HCC were statistically significantly lower than that of control(p<0.05). The Cu/Zn ratio was statiatically significantly different among three groups(p<0.05). Test total protein, albumin, ALP and total bilirubin of biochemical markers of liver function were statistically significantly different among three groups(p<0.05). Differences between cirrhosis and HCC for ALT and AST, and between the control and HCC for direct bilirubin were not statistically significant. Biochemical markers statistically significantly correlated with serum copper and zinc levels and Cu/Zn ratio(p<0.05), were variable in three groups. In multiple logistic regression, odds ratio of serum copper level and Cu/Zn ratio had no statistical significance on the cirrhosis and the HCC, but that of serum sinc was statistically significant as 0.951 and 0.952(p<0.05). Serum copper and zinc levels and Cu/Zn ratio were not statistically significantly different between the cirrhosis and HCC. H\Albumin, ALP, zinc, total bilirubin and age among all variables were selected as main variables for three-group discriminant analysis. Percentage of 'grouped' cases correctly classified by these five variables was 98.4 for control, 73.4 for cirrhosis, 75.7 for HCC and 84.0 for all subjects. This study suggests that zinc level is considered to play a role as diagnostic marker on the hepatic disorders and be more useful than serum copper level and Cu/Zn ratio in diagnosis of the liver diseases.

  • PDF

A Study on the Prediction Model of Stock Price Index Trend based on GA-MSVM that Simultaneously Optimizes Feature and Instance Selection (입력변수 및 학습사례 선정을 동시에 최적화하는 GA-MSVM 기반 주가지수 추세 예측 모형에 관한 연구)

  • Lee, Jong-sik;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.4
    • /
    • pp.147-168
    • /
    • 2017
  • There have been many studies on accurate stock market forecasting in academia for a long time, and now there are also various forecasting models using various techniques. Recently, many attempts have been made to predict the stock index using various machine learning methods including Deep Learning. Although the fundamental analysis and the technical analysis method are used for the analysis of the traditional stock investment transaction, the technical analysis method is more useful for the application of the short-term transaction prediction or statistical and mathematical techniques. Most of the studies that have been conducted using these technical indicators have studied the model of predicting stock prices by binary classification - rising or falling - of stock market fluctuations in the future market (usually next trading day). However, it is also true that this binary classification has many unfavorable aspects in predicting trends, identifying trading signals, or signaling portfolio rebalancing. In this study, we try to predict the stock index by expanding the stock index trend (upward trend, boxed, downward trend) to the multiple classification system in the existing binary index method. In order to solve this multi-classification problem, a technique such as Multinomial Logistic Regression Analysis (MLOGIT), Multiple Discriminant Analysis (MDA) or Artificial Neural Networks (ANN) we propose an optimization model using Genetic Algorithm as a wrapper for improving the performance of this model using Multi-classification Support Vector Machines (MSVM), which has proved to be superior in prediction performance. In particular, the proposed model named GA-MSVM is designed to maximize model performance by optimizing not only the kernel function parameters of MSVM, but also the optimal selection of input variables (feature selection) as well as instance selection. In order to verify the performance of the proposed model, we applied the proposed method to the real data. The results show that the proposed method is more effective than the conventional multivariate SVM, which has been known to show the best prediction performance up to now, as well as existing artificial intelligence / data mining techniques such as MDA, MLOGIT, CBR, and it is confirmed that the prediction performance is better than this. Especially, it has been confirmed that the 'instance selection' plays a very important role in predicting the stock index trend, and it is confirmed that the improvement effect of the model is more important than other factors. To verify the usefulness of GA-MSVM, we applied it to Korea's real KOSPI200 stock index trend forecast. Our research is primarily aimed at predicting trend segments to capture signal acquisition or short-term trend transition points. The experimental data set includes technical indicators such as the price and volatility index (2004 ~ 2017) and macroeconomic data (interest rate, exchange rate, S&P 500, etc.) of KOSPI200 stock index in Korea. Using a variety of statistical methods including one-way ANOVA and stepwise MDA, 15 indicators were selected as candidate independent variables. The dependent variable, trend classification, was classified into three states: 1 (upward trend), 0 (boxed), and -1 (downward trend). 70% of the total data for each class was used for training and the remaining 30% was used for verifying. To verify the performance of the proposed model, several comparative model experiments such as MDA, MLOGIT, CBR, ANN and MSVM were conducted. MSVM has adopted the One-Against-One (OAO) approach, which is known as the most accurate approach among the various MSVM approaches. Although there are some limitations, the final experimental results demonstrate that the proposed model, GA-MSVM, performs at a significantly higher level than all comparative models.

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.