• Title/Summary/Keyword: svm

Search Result 2,148, Processing Time 0.032 seconds

Data-centric XAI-driven Data Imputation of Molecular Structure and QSAR Model for Toxicity Prediction of 3D Printing Chemicals (3D 프린팅 소재 화학물질의 독성 예측을 위한 Data-centric XAI 기반 분자 구조 Data Imputation과 QSAR 모델 개발)

  • ChanHyeok Jeong;SangYoun Kim;SungKu Heo;Shahzeb Tariq;MinHyeok Shin;ChangKyoo Yoo
    • Korean Chemical Engineering Research
    • /
    • v.61 no.4
    • /
    • pp.523-541
    • /
    • 2023
  • As accessibility to 3D printers increases, there is a growing frequency of exposure to chemicals associated with 3D printing. However, research on the toxicity and harmfulness of chemicals generated by 3D printing is insufficient, and the performance of toxicity prediction using in silico techniques is limited due to missing molecular structure data. In this study, quantitative structure-activity relationship (QSAR) model based on data-centric AI approach was developed to predict the toxicity of new 3D printing materials by imputing missing values in molecular descriptors. First, MissForest algorithm was utilized to impute missing values in molecular descriptors of hazardous 3D printing materials. Then, based on four different machine learning models (decision tree, random forest, XGBoost, SVM), a machine learning (ML)-based QSAR model was developed to predict the bioconcentration factor (Log BCF), octanol-air partition coefficient (Log Koa), and partition coefficient (Log P). Furthermore, the reliability of the data-centric QSAR model was validated through the Tree-SHAP (SHapley Additive exPlanations) method, which is one of explainable artificial intelligence (XAI) techniques. The proposed imputation method based on the MissForest enlarged approximately 2.5 times more molecular structure data compared to the existing data. Based on the imputed dataset of molecular descriptor, the developed data-centric QSAR model achieved approximately 73%, 76% and 92% of prediction performance for Log BCF, Log Koa, and Log P, respectively. Lastly, Tree-SHAP analysis demonstrated that the data-centric-based QSAR model achieved high prediction performance for toxicity information by identifying key molecular descriptors highly correlated with toxicity indices. Therefore, the proposed QSAR model based on the data-centric XAI approach can be extended to predict the toxicity of potential pollutants in emerging printing chemicals, chemical process, semiconductor or display process.

Study on data preprocessing methods for considering snow accumulation and snow melt in dam inflow prediction using machine learning & deep learning models (머신러닝&딥러닝 모델을 활용한 댐 일유입량 예측시 융적설을 고려하기 위한 데이터 전처리에 대한 방법 연구)

  • Jo, Youngsik;Jung, Kwansue
    • Journal of Korea Water Resources Association
    • /
    • v.57 no.1
    • /
    • pp.35-44
    • /
    • 2024
  • Research in dam inflow prediction has actively explored the utilization of data-driven machine learning and deep learning (ML&DL) tools across diverse domains. Enhancing not just the inherent model performance but also accounting for model characteristics and preprocessing data are crucial elements for precise dam inflow prediction. Particularly, existing rainfall data, derived from snowfall amounts through heating facilities, introduces distortions in the correlation between snow accumulation and rainfall, especially in dam basins influenced by snow accumulation, such as Soyang Dam. This study focuses on the preprocessing of rainfall data essential for the application of ML&DL models in predicting dam inflow in basins affected by snow accumulation. This is vital to address phenomena like reduced outflow during winter due to low snowfall and increased outflow during spring despite minimal or no rain, both of which are physical occurrences. Three machine learning models (SVM, RF, LGBM) and two deep learning models (LSTM, TCN) were built by combining rainfall and inflow series. With optimal hyperparameter tuning, the appropriate model was selected, resulting in a high level of predictive performance with NSE ranging from 0.842 to 0.894. Moreover, to generate rainfall correction data considering snow accumulation, a simulated snow accumulation algorithm was developed. Applying this correction to machine learning and deep learning models yielded NSE values ranging from 0.841 to 0.896, indicating a similarly high level of predictive performance compared to the pre-snow accumulation application. Notably, during the snow accumulation period, adjusting rainfall during the training phase was observed to lead to a more accurate simulation of observed inflow when predicted. This underscores the importance of thoughtful data preprocessing, taking into account physical factors such as snowfall and snowmelt, in constructing data models.

Development of Predictive Models for Rights Issues Using Financial Analysis Indices and Decision Tree Technique (경영분석지표와 의사결정나무기법을 이용한 유상증자 예측모형 개발)

  • Kim, Myeong-Kyun;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.4
    • /
    • pp.59-77
    • /
    • 2012
  • This study focuses on predicting which firms will increase capital by issuing new stocks in the near future. Many stakeholders, including banks, credit rating agencies and investors, performs a variety of analyses for firms' growth, profitability, stability, activity, productivity, etc., and regularly report the firms' financial analysis indices. In the paper, we develop predictive models for rights issues using these financial analysis indices and data mining techniques. This study approaches to building the predictive models from the perspective of two different analyses. The first is the analysis period. We divide the analysis period into before and after the IMF financial crisis, and examine whether there is the difference between the two periods. The second is the prediction time. In order to predict when firms increase capital by issuing new stocks, the prediction time is categorized as one year, two years and three years later. Therefore Total six prediction models are developed and analyzed. In this paper, we employ the decision tree technique to build the prediction models for rights issues. The decision tree is the most widely used prediction method which builds decision trees to label or categorize cases into a set of known classes. In contrast to neural networks, logistic regression and SVM, decision tree techniques are well suited for high-dimensional applications and have strong explanation capabilities. There are well-known decision tree induction algorithms such as CHAID, CART, QUEST, C5.0, etc. Among them, we use C5.0 algorithm which is the most recently developed algorithm and yields performance better than other algorithms. We obtained data for the rights issue and financial analysis from TS2000 of Korea Listed Companies Association. A record of financial analysis data is consisted of 89 variables which include 9 growth indices, 30 profitability indices, 23 stability indices, 6 activity indices and 8 productivity indices. For the model building and test, we used 10,925 financial analysis data of total 658 listed firms. PASW Modeler 13 was used to build C5.0 decision trees for the six prediction models. Total 84 variables among financial analysis data are selected as the input variables of each model, and the rights issue status (issued or not issued) is defined as the output variable. To develop prediction models using C5.0 node (Node Options: Output type = Rule set, Use boosting = false, Cross-validate = false, Mode = Simple, Favor = Generality), we used 60% of data for model building and 40% of data for model test. The results of experimental analysis show that the prediction accuracies of data after the IMF financial crisis (59.04% to 60.43%) are about 10 percent higher than ones before IMF financial crisis (68.78% to 71.41%). These results indicate that since the IMF financial crisis, the reliability of financial analysis indices has increased and the firm intention of rights issue has been more obvious. The experiment results also show that the stability-related indices have a major impact on conducting rights issue in the case of short-term prediction. On the other hand, the long-term prediction of conducting rights issue is affected by financial analysis indices on profitability, stability, activity and productivity. All the prediction models include the industry code as one of significant variables. This means that companies in different types of industries show their different types of patterns for rights issue. We conclude that it is desirable for stakeholders to take into account stability-related indices and more various financial analysis indices for short-term prediction and long-term prediction, respectively. The current study has several limitations. First, we need to compare the differences in accuracy by using different data mining techniques such as neural networks, logistic regression and SVM. Second, we are required to develop and to evaluate new prediction models including variables which research in the theory of capital structure has mentioned about the relevance to rights issue.

A Time Series Graph based Convolutional Neural Network Model for Effective Input Variable Pattern Learning : Application to the Prediction of Stock Market (효과적인 입력변수 패턴 학습을 위한 시계열 그래프 기반 합성곱 신경망 모형: 주식시장 예측에의 응용)

  • Lee, Mo-Se;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.167-181
    • /
    • 2018
  • Over the past decade, deep learning has been in spotlight among various machine learning algorithms. In particular, CNN(Convolutional Neural Network), which is known as the effective solution for recognizing and classifying images or voices, has been popularly applied to classification and prediction problems. In this study, we investigate the way to apply CNN in business problem solving. Specifically, this study propose to apply CNN to stock market prediction, one of the most challenging tasks in the machine learning research. As mentioned, CNN has strength in interpreting images. Thus, the model proposed in this study adopts CNN as the binary classifier that predicts stock market direction (upward or downward) by using time series graphs as its inputs. That is, our proposal is to build a machine learning algorithm that mimics an experts called 'technical analysts' who examine the graph of past price movement, and predict future financial price movements. Our proposed model named 'CNN-FG(Convolutional Neural Network using Fluctuation Graph)' consists of five steps. In the first step, it divides the dataset into the intervals of 5 days. And then, it creates time series graphs for the divided dataset in step 2. The size of the image in which the graph is drawn is $40(pixels){\times}40(pixels)$, and the graph of each independent variable was drawn using different colors. In step 3, the model converts the images into the matrices. Each image is converted into the combination of three matrices in order to express the value of the color using R(red), G(green), and B(blue) scale. In the next step, it splits the dataset of the graph images into training and validation datasets. We used 80% of the total dataset as the training dataset, and the remaining 20% as the validation dataset. And then, CNN classifiers are trained using the images of training dataset in the final step. Regarding the parameters of CNN-FG, we adopted two convolution filters ($5{\times}5{\times}6$ and $5{\times}5{\times}9$) in the convolution layer. In the pooling layer, $2{\times}2$ max pooling filter was used. The numbers of the nodes in two hidden layers were set to, respectively, 900 and 32, and the number of the nodes in the output layer was set to 2(one is for the prediction of upward trend, and the other one is for downward trend). Activation functions for the convolution layer and the hidden layer were set to ReLU(Rectified Linear Unit), and one for the output layer set to Softmax function. To validate our model - CNN-FG, we applied it to the prediction of KOSPI200 for 2,026 days in eight years (from 2009 to 2016). To match the proportions of the two groups in the independent variable (i.e. tomorrow's stock market movement), we selected 1,950 samples by applying random sampling. Finally, we built the training dataset using 80% of the total dataset (1,560 samples), and the validation dataset using 20% (390 samples). The dependent variables of the experimental dataset included twelve technical indicators popularly been used in the previous studies. They include Stochastic %K, Stochastic %D, Momentum, ROC(rate of change), LW %R(Larry William's %R), A/D oscillator(accumulation/distribution oscillator), OSCP(price oscillator), CCI(commodity channel index), and so on. To confirm the superiority of CNN-FG, we compared its prediction accuracy with the ones of other classification models. Experimental results showed that CNN-FG outperforms LOGIT(logistic regression), ANN(artificial neural network), and SVM(support vector machine) with the statistical significance. These empirical results imply that converting time series business data into graphs and building CNN-based classification models using these graphs can be effective from the perspective of prediction accuracy. Thus, this paper sheds a light on how to apply deep learning techniques to the domain of business problem solving.

Basic Research on the Possibility of Developing a Landscape Perceptual Response Prediction Model Using Artificial Intelligence - Focusing on Machine Learning Techniques - (인공지능을 활용한 경관 지각반응 예측모델 개발 가능성 기초연구 - 머신러닝 기법을 중심으로 -)

  • Kim, Jin-Pyo;Suh, Joo-Hwan
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.51 no.3
    • /
    • pp.70-82
    • /
    • 2023
  • The recent surge of IT and data acquisition is shifting the paradigm in all aspects of life, and these advances are also affecting academic fields. Research topics and methods are being improved through academic exchange and connections. In particular, data-based research methods are employed in various academic fields, including landscape architecture, where continuous research is needed. Therefore, this study aims to investigate the possibility of developing a landscape preference evaluation and prediction model using machine learning, a branch of Artificial Intelligence, reflecting the current situation. To achieve the goal of this study, machine learning techniques were applied to the landscaping field to build a landscape preference evaluation and prediction model to verify the simulation accuracy of the model. For this, wind power facility landscape images, recently attracting attention as a renewable energy source, were selected as the research objects. For analysis, images of the wind power facility landscapes were collected using web crawling techniques, and an analysis dataset was built. Orange version 3.33, a program from the University of Ljubljana was used for machine learning analysis to derive a prediction model with excellent performance. IA model that integrates the evaluation criteria of machine learning and a separate model structure for the evaluation criteria were used to generate a model using kNN, SVM, Random Forest, Logistic Regression, and Neural Network algorithms suitable for machine learning classification models. The performance evaluation of the generated models was conducted to derive the most suitable prediction model. The prediction model derived in this study separately evaluates three evaluation criteria, including classification by type of landscape, classification by distance between landscape and target, and classification by preference, and then synthesizes and predicts results. As a result of the study, a prediction model with a high accuracy of 0.986 for the evaluation criterion according to the type of landscape, 0.973 for the evaluation criterion according to the distance, and 0.952 for the evaluation criterion according to the preference was developed, and it can be seen that the verification process through the evaluation of data prediction results exceeds the required performance value of the model. As an experimental attempt to investigate the possibility of developing a prediction model using machine learning in landscape-related research, this study was able to confirm the possibility of creating a high-performance prediction model by building a data set through the collection and refinement of image data and subsequently utilizing it in landscape-related research fields. Based on the results, implications, and limitations of this study, it is believed that it is possible to develop various types of landscape prediction models, including wind power facility natural, and cultural landscapes. Machine learning techniques can be more useful and valuable in the field of landscape architecture by exploring and applying research methods appropriate to the topic, reducing the time of data classification through the study of a model that classifies images according to landscape types or analyzing the importance of landscape planning factors through the analysis of landscape prediction factors using machine learning.

A Study on Method for User Gender Prediction Using Multi-Modal Smart Device Log Data (스마트 기기의 멀티 모달 로그 데이터를 이용한 사용자 성별 예측 기법 연구)

  • Kim, Yoonjung;Choi, Yerim;Kim, Solee;Park, Kyuyon;Park, Jonghun
    • The Journal of Society for e-Business Studies
    • /
    • v.21 no.1
    • /
    • pp.147-163
    • /
    • 2016
  • Gender information of a smart device user is essential to provide personalized services, and multi-modal data obtained from the device is useful for predicting the gender of the user. However, the method for utilizing each of the multi-modal data for gender prediction differs according to the characteristics of the data. Therefore, in this study, an ensemble method for predicting the gender of a smart device user by using three classifiers that have text, application, and acceleration data as inputs, respectively, is proposed. To alleviate privacy issues that occur when text data generated in a smart device are sent outside, a classification method which scans smart device text data only on the device and classifies the gender of the user by matching text data with predefined sets of word. An application based classifier assigns gender labels to executed applications and predicts gender of the user by comparing the label ratio. Acceleration data is used with Support Vector Machine to classify user gender. The proposed method was evaluated by using the actual smart device log data collected from an Android application. The experimental results showed that the proposed method outperformed the compared methods.

Unsupervised Classification of Landsat-8 OLI Satellite Imagery Based on Iterative Spectral Mixture Model (자동화된 훈련 자료를 활용한 Landsat-8 OLI 위성영상의 반복적 분광혼합모델 기반 무감독 분류)

  • Choi, Jae Wan;Noh, Sin Taek;Choi, Seok Keun
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.22 no.4
    • /
    • pp.53-61
    • /
    • 2014
  • Landsat OLI satellite imagery can be applied to various remote sensing applications, such as generation of land cover map, urban area analysis, extraction of vegetation index and change detection, because it includes various multispectral bands. In addition, land cover map is an important information to monitor and analyze land cover using GIS. In this paper, land cover map is generated by using Landsat OLI and existing land cover map. First, training dataset is obtained using correlation between existing land cover map and unsupervised classification result by K-means, automatically. And then, spectral signatures corresponding to each class are determined based on training data. Finally, abundance map and land cover map are generated by using iterative spectral mixture model. The experiment is accomplished by Landsat OLI of Cheongju area. It shows that result by our method can produce land cover map without manual training dataset, compared to existing land cover map and result by supervised classification result by SVM, quantitatively and visually.

A System for Automatic Classification of Traditional Culture Texts (전통문화 콘텐츠 표준체계를 활용한 자동 텍스트 분류 시스템)

  • Hur, YunA;Lee, DongYub;Kim, Kuekyeng;Yu, Wonhee;Lim, HeuiSeok
    • Journal of the Korea Convergence Society
    • /
    • v.8 no.12
    • /
    • pp.39-47
    • /
    • 2017
  • The Internet have increased the number of digital web documents related to the history and traditions of Korean Culture. However, users who search for creators or materials related to traditional cultures are not able to get the information they want and the results are not enough. Document classification is required to access this effective information. In the past, document classification has been difficult to manually and manually classify documents, but it has recently been difficult to spend a lot of time and money. Therefore, this paper develops an automatic text classification model of traditional cultural contents based on the data of the Korean information culture field composed of systematic classifications of traditional cultural contents. This study applied TF-IDF model, Bag-of-Words model, and TF-IDF/Bag-of-Words combined model to extract word frequencies for 'Korea Traditional Culture' data. And we developed the automatic text classification model of traditional cultural contents using Support Vector Machine classification algorithm.

Prediction and analysis of acute fish toxicity of pesticides to the rainbow trout using 2D-QSAR (2D-QSAR방법을 이용한 농약류의 무지개 송어 급성 어독성 분석 및 예측)

  • Song, In-Sik;Cha, Ji-Young;Lee, Sung-Kwang
    • Analytical Science and Technology
    • /
    • v.24 no.6
    • /
    • pp.544-555
    • /
    • 2011
  • The acute toxicity in the rainbow trout (Oncorhynchus mykiss) was analyzed and predicted using quantitative structure-activity relationships (QSAR). The aquatic toxicity, 96h $LC_{50}$ (median lethal concentration) of 275 organic pesticides, was obtained from EU-funded project DEMETRA. Prediction models were derived from 558 2D molecular descriptors, calculated in PreADMET. The linear (multiple linear regression) and nonlinear (support vector machine and artificial neural network) learning methods were optimized by taking into account the statistical parameters between the experimental and predicted p$LC_{50}$. After preprocessing, population based forward selection were used to select the best subsets of descriptors in the learning methods including 5-fold cross-validation procedure. The support vector machine model was used as the best model ($R^2_{CV}$=0.677, RMSECV=0.887, MSECV=0.674) and also correctly classified 87% for the training set according to EU regulation criteria. The MLR model could describe the structural characteristics of toxic chemicals and interaction with lipid membrane of fish. All the developed models were validated by 5 fold cross-validation and Y-scrambling test.

Classification of Multi-temporal SAR Data by Using Data Transform Based Features and Multiple Classifiers (자료변환 기반 특징과 다중 분류자를 이용한 다중시기 SAR자료의 분류)

  • Yoo, Hee Young;Park, No-Wook;Hong, Sukyoung;Lee, Kyungdo;Kim, Yeseul
    • Korean Journal of Remote Sensing
    • /
    • v.31 no.3
    • /
    • pp.205-214
    • /
    • 2015
  • In this study, a novel land-cover classification framework for multi-temporal SAR data is presented that can combine multiple features extracted through data transforms and multiple classifiers. At first, data transforms using principle component analysis (PCA) and 3D wavelet transform are applied to multi-temporal SAR dataset for extracting new features which were different from original dataset. Then, three different classifiers including maximum likelihood classifier (MLC), neural network (NN) and support vector machine (SVM) are applied to three different dataset including data transform based features and original backscattering coefficients, and as a result, the diverse preliminary classification results are generated. These results are combined via a majority voting rule to generate a final classification result. From an experiment with a multi-temporal ENVISAT ASAR dataset, every preliminary classification result showed very different classification accuracy according to the used feature and classifier. The final classification result combining nine preliminary classification results showed the best classification accuracy because each preliminary classification result provided complementary information on land-covers. The improvement of classification accuracy in this study was mainly attributed to the diversity from combining not only different features based on data transforms, but also different classifiers. Therefore, the land-cover classification framework presented in this study would be effectively applied to the classification of multi-temporal SAR data and also be extended to multi-sensor remote sensing data fusion.