• Title/Summary/Keyword: basic vector

Search Result 425, Processing Time 0.027 seconds

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

Classification of the Korean Local Pearl Barley(Coix larcryma L.) by the Morphological Characters (재래종(在來種) 율무(의이인(薏苡仁))의 형태적(形態的) 특성(特性)에 의한 분류(分類))

  • Kim, Bo Kyeong;Choe, Bong Ho
    • Korean Journal of Agricultural Science
    • /
    • v.13 no.1
    • /
    • pp.17-32
    • /
    • 1986
  • To obtain basic information needed for developing better pearl barley varieties, a total of 148 lines of pearl barley were collected from nationwide survey except for Kangwon and Chejoo provinces and classified by principal component analysis. The results are summarized as follows : 1. Variabilities of characters for all lines except for leaf width and 100 K. Wt.(Unpolished) were high enough to indicate variation of lines. 2. Correlation coefficients among 18 characters were high enough and they showed the shape of normal distribution, more or less, inclined toward positive values. 3. The lines could be classified into four groups by correlation coefficient for 18 characters : Group I was characterized as the lines composed of grain and plant type, Group II maturity, Group III the number of tillers, and Group IV the nature of germination, respectively. 4. About 60% of the total variation could be appreciated by the first four principal components and about 89% of the total variation by the first ten principal components. 5. Contribution of characters to principal components was variable and was high at upper principal components and low at lower principal components. 6. The value of eigen vector corresponding to those which had high significant correlation coefficient between characters was almost of the same value. 7. The lines were classified into four groups by principal component analysis. 8. The lines were also classified into four groups by taxonomic distance. Group I included 79 lines, Group II 40 lines, Group III 22 lines, and Group IV 7 lines, respectively. 9. Four groups classified by taxonomic distance could be characterized as follow : Group I : medium height plant, small kernels, medium maturity, and narrow and short leaf, Group II : short height plant, small kernels, early maturity, and narrow and short leaf. Group III : tall height plant, large kernels, late maturity, and broad and long leaf. Group IV : short height plant, large kernels, medium maturity, and narrow and short leaf.

  • PDF

Estimation of the Lowest and Highest Astronomical Tides along the west and south coast of Korea from 1999 to 2017 (서해안과 남해안에서 1999년부터 2017년까지 최저와 최고 천문조위 계산)

  • BYUN, DO-SEONG;CHOI, BYOUNG-JU;KIM, HYOWON
    • The Sea:JOURNAL OF THE KOREAN SOCIETY OF OCEANOGRAPHY
    • /
    • v.24 no.4
    • /
    • pp.495-508
    • /
    • 2019
  • Tidal datums are key and basic information used in fields of navigation, coastal structures' design, maritime boundary delimitation and inundation warning. In Korea, the Approximate Lowest Low Water (ALLW) and the Approximate Highest High Water (AHHW) have been used as levels of tidal datums for depth, coastline and vertical clearances in hydrography and coastal engineering fields. However, recently the major maritime countries including USA, Australia and UK have adopted the Lowest Astronomical Tide (LAT) and the Highest Astronomical Tide (HAT) as the tidal datums. In this study, 1-hr interval 19-year sea level records (1999-2017) observed at 9 tidal observation stations along the west and south coasts of Korea were used to calculate LAT and HAT for each station using 1-minute interval 19-year tidal prediction data yielded through three tidal harmonic methods: 19 year vector average of tidal harmonic constants (Vector Average Method, VA), tidal harmonic analysis on 19 years of continuous data (19-year Method, 19Y) and tidal harmonic analysis on one year of data (1-year Method, 1Y). The calculated LAT and HAT values were quantitatively compared with the ALLW and AHHW values, respectively. The main causes of the difference between them were explored. In this study, we used the UTide, which is capable of conducting 19-year record tidal harmonic analysis and 19 year tidal prediction. Application of the three harmonic methods showed that there were relatively small differences (mostly less than ±1 cm) of the values of LAT and HAT calculated from the VA and 19Y methods, revealing that each method can be mutually and effectively used. In contrast, the standard deviations between LATs and HATs calculated from the 1Y and 19Y methods were 3~7 cm. The LAT (HAT) differences between the 1Y and 19Y methods range from -16.4 to 10.7 cm (-8.2 to 14.3 cm), which are relatively large compared to the LAT and HAT differences between the VA and 19Y methods. The LAT (HAT) values are, on average, 33.6 (46.2) cm lower (higher) than those of ALLW (AHHW) along the west and south coast of Korea. It was found that the Sa and N2 tides significantly contribute to these differences. In the shallow water constituents dominated area, the M4 and MS4 tides also remarkably contribute to them. Differences between the LAT and the ALLW are larger than those between the HAT and the AHHW. The asymmetry occurs because the LAT and HAT are calculated from the amplitudes and phase-lags of 67 harmonic constituents whereas the ALLW and AHHW are based only on the amplitudes of the 4 major harmonic constituents.

Estimation of Water Quality Index for Coastal Areas in Korea Using GOCI Satellite Data Based on Machine Learning Approaches (GOCI 위성영상과 기계학습을 이용한 한반도 연안 수질평가지수 추정)

  • Jang, Eunna;Im, Jungho;Ha, Sunghyun;Lee, Sanggyun;Park, Young-Gyu
    • Korean Journal of Remote Sensing
    • /
    • v.32 no.3
    • /
    • pp.221-234
    • /
    • 2016
  • In Korea, most industrial parks and major cities are located in coastal areas, which results in serious environmental problems in both coastal land and ocean. In order to effectively manage such problems especially in coastal ocean, water quality should be monitored. As there are many factors that influence water quality, the Korean Government proposed an integrated Water Quality Index (WQI) based on in situmeasurements of ocean parameters(bottom dissolved oxygen, chlorophyll-a concentration, secchi disk depth, dissolved inorganic nitrogen, and dissolved inorganic phosphorus) by ocean division identified based on their ecological characteristics. Field-measured WQI, however, does not provide spatial continuity over vast areas. Satellite remote sensing can be an alternative for identifying WQI for surface water. In this study, two schemes were examined to estimate coastal WQI around Korea peninsula using in situ measurements data and Geostationary Ocean Color Imager (GOCI) satellite imagery from 2011 to 2013 based on machine learning approaches. Scheme 1 calculates WQI using estimated water quality-related factors using GOCI reflectance data, and scheme 2 estimates WQI using GOCI band reflectance data and basic products(chlorophyll-a, suspended sediment, colored dissolved organic matter). Three machine learning approaches including Random Forest (RF), Support Vector Regression (SVR), and a modified regression tree(Cubist) were used. Results show that estimation of secchi disk depth produced the highest accuracy among the ocean parameters, and RF performed best regardless of water quality-related factors. However, the accuracy of WQI from scheme 1 was lower than that from scheme 2 due to the estimation errors inherent from water quality-related factors and the uncertainty of bottom dissolved oxygen. In overall, scheme 2 appears more appropriate for estimating WQI for surface water in coastal areas and chlorophyll-a concentration was identified the most contributing factor to the estimation of WQI.

A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification (한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구)

  • Lee, Jae-Seong;Jun, Seung-Pyo;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.221-241
    • /
    • 2018
  • As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC. Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.

Protoplast Formation, Regeneration and Reversion in Pleurotus ostreatus and P. sajor-caju (느타리버섯과 여름느타리버섯의 원형질체(原形質體) 나출(裸出)과 재생(再生))

  • Go, Seung-Joo;Shin, Gwan-Chull;Yoo, Young-Bok
    • The Korean Journal of Mycology
    • /
    • v.13 no.3
    • /
    • pp.169-177
    • /
    • 1985
  • The studies were carried out to obtain the basic data for maximizing the protoplast yields from the mycelia of P. ostreatus and P. sajor-caju. Some factors affecting the regeneration of the protoplast of both species and the productivity of their reversion were also examined. The maximum yields of protoplasts were obtained from four days cultured mycelia of both species on cellophan membrane placed on the surface of PSA or MCM media in a petri dish. The optimal concentration of lytic enzyme Novozym 234 for protoplast releasing was 5 mg per ml of 0.5 M phosphate buffer solution with 0.6 M sucrose or 0.6 M $MgSO_4$ at pH 6.0. The greatest number of protoplasts was released 3 hours after incubation of the mycelia of P. ostreatus and after 4 hours for the P. sajor-caju in the lytic enzyme solution. Among the osmotic stabilizer solutions tested 0.6 M sucrose and 0.6 M KCl showed the best regeneration rates of the protoplasts of both species. When 0.75 % agar solution was over-layed on the regeneration media immediately after inoculation of the protoplast the regeneration rates were greatly enhanced. The ampicillin added to the agar solution prevented bacteria from infection. The reverted isolates produced the sporophores and basidial spores just like their parents without any mutations when they were cultivated in a broad mouth bottle with sawdust substrates.

  • PDF

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

Development of GIS Application using Web-based CAD (Web기반 CAD를 이용한 지리정보시스템 구현)

  • Kim, Han-Su;Im, Jun-Hong;Kim, Jae-Deuk;Shin, So-Eun
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.3 no.3
    • /
    • pp.69-76
    • /
    • 2000
  • This study deals with development GIS application using web-based CAD, this application serves to user, designer, manager that more convenient and various functions. Development to this application, collect attribute data from fieldwork and geographic data from cadastral map and aerial survey map and then development to user interface using HTML, JavaScript, ASP, Whip ActiveX control. This application's characters are as follows ; First, system designer designed that anyone who have basic knowledge about web and CAD can develop this application. A system structure simplification by 2-Tier. Geographic information use DWF(drawing web format) file and attribute information use DBMS in consideration of extension. Second, system manager can service independently GIS in Web need not high priced GIS engine, so more economical. Third, internet user get service GIS information and function that search of information, zoom in/out, pan, print etc., if you need more functions, add function without difficultly. Developed application as above, not only save volume but fast of speed as use vector data exclude character and image data. Also, this application can used by means of commercial and travel information service but also various GIS service of public institution and private in web.

  • PDF

Gnawing and Escaping Behaviors of Monochamus alternatus (Coleoptera: Cerambycidae) in a Confined Environment: Suggesting a Bioassay Method of Netting for Adult Escape Prevention (인위적 구속환경에서 솔수염하늘소의 쏠기와 탈출행동: 성충탈출 방지용 그물망의 생물검정법 제안)

  • Ko, Gyeong hun;Kim, Dong-Soon
    • Korean journal of applied entomology
    • /
    • v.56 no.2
    • /
    • pp.187-193
    • /
    • 2017
  • The Japanese pine sawyer, Monochamus alternatus Hope, is a representative vector of the pine wood nematode, Bursaphelenchus xylophilus, which causes wilting symptoms in pine trees. A control method using a net has been introduced, which is an alternative method to the fumigation for the control of dead pine trees by pine wilt disease. This study was carried out to investigate the factors that induce gnawing and escaping behaviors of M. alternatus. The behaviors were examined after M. alternatus adult was placed in a confined space at different temperatures. M. alternatus adults could escape through mesh net torn by gnawing when they were confined in a space of 30 mm or less in diameter. The success rate of escape was high at 20 to $30^{\circ}C$, and no adults escaped at $15^{\circ}C$. The enticement of M. alternatus adults by food didn't affect the success rate of escape. In the case of not being confined in a narrow space, the escaping hole could not be formed because the gnawing was not concentrated on one part. M. alternatus moved its body in a narrow space using the tarsus of middle and hind legs, and made an escape hole by concentrically gnawing the obstacle on the front side with mandible, and showed a behavior of getting out while supporting the body by supporting the front legs. The present results will be able to use as an important basic information for evaluating the performance of mesh net which confines M. alternatus adults and suggested by alternative method to fumigation technology.