• Title/Summary/Keyword: Dataset Training

Search Result 650, Processing Time 0.073 seconds

A Time Series Graph based Convolutional Neural Network Model for Effective Input Variable Pattern Learning : Application to the Prediction of Stock Market (효과적인 입력변수 패턴 학습을 위한 시계열 그래프 기반 합성곱 신경망 모형: 주식시장 예측에의 응용)

  • Lee, Mo-Se;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.167-181
    • /
    • 2018
  • Over the past decade, deep learning has been in spotlight among various machine learning algorithms. In particular, CNN(Convolutional Neural Network), which is known as the effective solution for recognizing and classifying images or voices, has been popularly applied to classification and prediction problems. In this study, we investigate the way to apply CNN in business problem solving. Specifically, this study propose to apply CNN to stock market prediction, one of the most challenging tasks in the machine learning research. As mentioned, CNN has strength in interpreting images. Thus, the model proposed in this study adopts CNN as the binary classifier that predicts stock market direction (upward or downward) by using time series graphs as its inputs. That is, our proposal is to build a machine learning algorithm that mimics an experts called 'technical analysts' who examine the graph of past price movement, and predict future financial price movements. Our proposed model named 'CNN-FG(Convolutional Neural Network using Fluctuation Graph)' consists of five steps. In the first step, it divides the dataset into the intervals of 5 days. And then, it creates time series graphs for the divided dataset in step 2. The size of the image in which the graph is drawn is $40(pixels){\times}40(pixels)$, and the graph of each independent variable was drawn using different colors. In step 3, the model converts the images into the matrices. Each image is converted into the combination of three matrices in order to express the value of the color using R(red), G(green), and B(blue) scale. In the next step, it splits the dataset of the graph images into training and validation datasets. We used 80% of the total dataset as the training dataset, and the remaining 20% as the validation dataset. And then, CNN classifiers are trained using the images of training dataset in the final step. Regarding the parameters of CNN-FG, we adopted two convolution filters ($5{\times}5{\times}6$ and $5{\times}5{\times}9$) in the convolution layer. In the pooling layer, $2{\times}2$ max pooling filter was used. The numbers of the nodes in two hidden layers were set to, respectively, 900 and 32, and the number of the nodes in the output layer was set to 2(one is for the prediction of upward trend, and the other one is for downward trend). Activation functions for the convolution layer and the hidden layer were set to ReLU(Rectified Linear Unit), and one for the output layer set to Softmax function. To validate our model - CNN-FG, we applied it to the prediction of KOSPI200 for 2,026 days in eight years (from 2009 to 2016). To match the proportions of the two groups in the independent variable (i.e. tomorrow's stock market movement), we selected 1,950 samples by applying random sampling. Finally, we built the training dataset using 80% of the total dataset (1,560 samples), and the validation dataset using 20% (390 samples). The dependent variables of the experimental dataset included twelve technical indicators popularly been used in the previous studies. They include Stochastic %K, Stochastic %D, Momentum, ROC(rate of change), LW %R(Larry William's %R), A/D oscillator(accumulation/distribution oscillator), OSCP(price oscillator), CCI(commodity channel index), and so on. To confirm the superiority of CNN-FG, we compared its prediction accuracy with the ones of other classification models. Experimental results showed that CNN-FG outperforms LOGIT(logistic regression), ANN(artificial neural network), and SVM(support vector machine) with the statistical significance. These empirical results imply that converting time series business data into graphs and building CNN-based classification models using these graphs can be effective from the perspective of prediction accuracy. Thus, this paper sheds a light on how to apply deep learning techniques to the domain of business problem solving.

Generating Training Dataset of Machine Learning Model for Context-Awareness in a Health Status Notification Service (사용자 건강 상태알림 서비스의 상황인지를 위한 기계학습 모델의 학습 데이터 생성 방법)

  • Mun, Jong Hyeok;Choi, Jong Sun;Choi, Jae Young
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.1
    • /
    • pp.25-32
    • /
    • 2020
  • In the context-aware system, rule-based AI technology has been used in the abstraction process for getting context information. However, the rules are complicated by the diversification of user requirements for the service and also data usage is increased. Therefore, there are some technical limitations to maintain rule-based models and to process unstructured data. To overcome these limitations, many studies have applied machine learning techniques to Context-aware systems. In order to utilize this machine learning-based model in the context-aware system, a management process of periodically injecting training data is required. In the previous study on the machine learning based context awareness system, a series of management processes such as the generation and provision of learning data for operating several machine learning models were considered, but the method was limited to the applied system. In this paper, we propose a training data generating method of a machine learning model to extend the machine learning based context-aware system. The proposed method define the training data generating model that can reflect the requirements of the machine learning models and generate the training data for each machine learning model. In the experiment, the training data generating model is defined based on the training data generating schema of the cardiac status analysis model for older in health status notification service, and the training data is generated by applying the model defined in the real environment of the software. In addition, it shows the process of comparing the accuracy by learning the training data generated in the machine learning model, and applied to verify the validity of the generated learning data.

Feasibility of fully automated classification of whole slide images based on deep learning

  • Cho, Kyung-Ok;Lee, Sung Hak;Jang, Hyun-Jong
    • The Korean Journal of Physiology and Pharmacology
    • /
    • v.24 no.1
    • /
    • pp.89-99
    • /
    • 2020
  • Although microscopic analysis of tissue slides has been the basis for disease diagnosis for decades, intra- and inter-observer variabilities remain issues to be resolved. The recent introduction of digital scanners has allowed for using deep learning in the analysis of tissue images because many whole slide images (WSIs) are accessible to researchers. In the present study, we investigated the possibility of a deep learning-based, fully automated, computer-aided diagnosis system with WSIs from a stomach adenocarcinoma dataset. Three different convolutional neural network architectures were tested to determine the better architecture for tissue classifier. Each network was trained to classify small tissue patches into normal or tumor. Based on the patch-level classification, tumor probability heatmaps can be overlaid on tissue images. We observed three different tissue patterns, including clear normal, clear tumor and ambiguous cases. We suggest that longer inspection time can be assigned to ambiguous cases compared to clear normal cases, increasing the accuracy and efficiency of histopathologic diagnosis by pre-evaluating the status of the WSIs. When the classifier was tested with completely different WSI dataset, the performance was not optimal because of the different tissue preparation quality. By including a small amount of data from the new dataset for training, the performance for the new dataset was much enhanced. These results indicated that WSI dataset should include tissues prepared from many different preparation conditions to construct a generalized tissue classifier. Thus, multi-national/multi-center dataset should be built for the application of deep learning in the real world medical practice.

Developing an Ensemble Classifier for Bankruptcy Prediction (부도 예측을 위한 앙상블 분류기 개발)

  • Min, Sung-Hwan
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.7
    • /
    • pp.139-148
    • /
    • 2012
  • An ensemble of classifiers is to employ a set of individually trained classifiers and combine their predictions. It has been found that in most cases the ensembles produce more accurate predictions than the base classifiers. Combining outputs from multiple classifiers, known as ensemble learning, is one of the standard and most important techniques for improving classification accuracy in machine learning. An ensemble of classifiers is efficient only if the individual classifiers make decisions as diverse as possible. Bagging is the most popular method of ensemble learning to generate a diverse set of classifiers. Diversity in bagging is obtained by using different training sets. The different training data subsets are randomly drawn with replacement from the entire training dataset. The random subspace method is an ensemble construction technique using different attribute subsets. In the random subspace, the training dataset is also modified as in bagging. However, this modification is performed in the feature space. Bagging and random subspace are quite well known and popular ensemble algorithms. However, few studies have dealt with the integration of bagging and random subspace using SVM Classifiers, though there is a great potential for useful applications in this area. The focus of this paper is to propose methods for improving SVM performance using hybrid ensemble strategy for bankruptcy prediction. This paper applies the proposed ensemble model to the bankruptcy prediction problem using a real data set from Korean companies.

Multiview Data Clustering by using Adaptive Spectral Co-clustering (적응형 분광 군집 방법을 이용한 다중 특징 데이터 군집화)

  • Son, Jeong-Woo;Jeon, Junekey;Lee, Sang-Yun;Kim, Sun-Joong
    • Journal of KIISE
    • /
    • v.43 no.6
    • /
    • pp.686-691
    • /
    • 2016
  • In this paper, we introduced the adaptive spectral co-clustering, a spectral clustering for multiview data, especially data with more than three views. In the adaptive spectral co-clustering, the performance is improved by sharing information from diverse views. For the efficiency in information sharing, a co-training approach is adopted. In the co-training step, a set of parameters are estimated to make all views in data maximally independent, and then, information is shared with respect to estimated parameters. This co-training step increases the efficiency of information sharing comparing with ordinary feature concatenation and co-training methods that assume the independence among views. The adaptive spectral co-clustering was evaluated with synthetic dataset and multi lingual document dataset. The experimental results indicated the efficiency of the adaptive spectral co-clustering with the performances in every iterations and similarity matrix generated with information sharing.

MapReduce-based Localized Linear Regression for Electricity Price Forecasting (전기 가격 예측을 위한 맵리듀스 기반의 로컬 단위 선형회귀 모델)

  • Han, Jinju;Lee, Ingyu;On, Byung-Won
    • The Transactions of the Korean Institute of Electrical Engineers P
    • /
    • v.67 no.4
    • /
    • pp.183-190
    • /
    • 2018
  • Predicting accurate electricity prices is an important task in the electricity trading market. To address the electricity price forecasting problem, various approaches have been proposed so far and it is known that linear regression-based approaches are the best. However, the use of such linear regression-based methods is limited due to low accuracy and performance. In traditional linear regression methods, it is not practical to find a nonlinear regression model that explains the training data well. If the training data is complex (i.e., small-sized individual data and large-sized features), it is difficult to find the polynomial function with n terms as the model that fits to the training data. On the other hand, as a linear regression model approximating a nonlinear regression model is used, the accuracy of the model drops considerably because it does not accurately reflect the characteristics of the training data. To cope with this problem, we propose a new electricity price forecasting method that divides the entire dataset to multiple split datasets and find the best linear regression models, each of which is the optimal model in each dataset. Meanwhile, to improve the performance of the proposed method, we modify the proposed localized linear regression method in the map and reduce way that is a framework for parallel processing data stored in a Hadoop distributed file system. Our experimental results show that the proposed model outperforms the existing linear regression model. Specifically, the accuracy of the proposed method is improved by 45% and the performance is faster 5 times than the existing linear regression-based model.

Improving Adversarial Robustness via Attention (Attention 기법에 기반한 적대적 공격의 강건성 향상 연구)

  • Jaeuk Kim;Myung Gyo Oh;Leo Hyun Park;Taekyoung Kwon
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.33 no.4
    • /
    • pp.621-631
    • /
    • 2023
  • Adversarial training improves the robustness of deep neural networks for adversarial examples. However, the previous adversarial training method focuses only on the adversarial loss function, ignoring that even a small perturbation of the input layer causes a significant change in the hidden layer features. Consequently, the accuracy of a defended model is reduced for various untrained situations such as clean samples or other attack techniques. Therefore, an architectural perspective is necessary to improve feature representation power to solve this problem. In this paper, we apply an attention module that generates an attention map of an input image to a general model and performs PGD adversarial training upon the augmented model. In our experiments on the CIFAR-10 dataset, the attention augmented model showed higher accuracy than the general model regardless of the network structure. In particular, the robust accuracy of our approach was consistently higher for various attacks such as PGD, FGSM, and BIM and more powerful adversaries. By visualizing the attention map, we further confirmed that the attention module extracts features of the correct class even for adversarial examples.

Invasion of Pivacy of Federated Learning by Data Reconstruction Attack with Technique for Converting Pixel Value (픽셀값 변환 기법을 더한 데이터 복원공격에의한 연합학습의 프라이버시 침해)

  • Yoon-ju Oh;Dae-seon Choi
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.33 no.1
    • /
    • pp.63-74
    • /
    • 2023
  • In order to ensure safety to invasion of privacy, Federated Learning(FL) that learns using parameters is emerging. However a paper that leaks training data using gradients was recently published. Our paper implements an experiment to leak training data using gradients in a federated learning environment, and proposes a method to improve reconstruction performance by improving existing attacks that leak training data. Experiments using Yale face database B, MNIST dataset on the proposed method show that federated learning is not safe from invasion of privacy by reconstructing up to 100 data out of 100 training data when performance of federated learning is high at accuracy=99~100%. In addition, by comparing the performance (MSE, PSNR, SSIM) of pixels and the performance of identification by Human Test, we want to emphasize the importance of the performance of identification rather than the performance of pixels.

Development of Autonomous Vehicle Learning Data Generation System (자율주행 차량의 학습 데이터 자동 생성 시스템 개발)

  • Yoon, Seungje;Jung, Jiwon;Hong, June;Lim, Kyungil;Kim, Jaehwan;Kim, Hyungjoo
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.19 no.5
    • /
    • pp.162-177
    • /
    • 2020
  • The perception of traffic environment based on various sensors in autonomous driving system has a direct relationship with driving safety. Recently, as the perception model based on deep neural network is used due to the development of machine learning/in-depth neural network technology, a the perception model training and high quality of a training dataset are required. However, there are several realistic difficulties to collect data on all situations that may occur in self-driving. The performance of the perception model may be deteriorated due to the difference between the overseas and domestic traffic environments, and data on bad weather where the sensors can not operate normally can not guarantee the qualitative part. Therefore, it is necessary to build a virtual road environment in the simulator rather than the actual road to collect the traning data. In this paper, a training dataset collection process is suggested by diversifying the weather, illumination, sensor position, type and counts of vehicles in the simulator environment that simulates the domestic road situation according to the domestic situation. In order to achieve better performance, the authors changed the domain of image to be closer to due diligence and diversified. And the performance evaluation was conducted on the test data collected in the actual road environment, and the performance was similar to that of the model learned only by the actual environmental data.

Document Image Binarization by GAN with Unpaired Data Training

  • Dang, Quang-Vinh;Lee, Guee-Sang
    • International Journal of Contents
    • /
    • v.16 no.2
    • /
    • pp.8-18
    • /
    • 2020
  • Data is critical in deep learning but the scarcity of data often occurs in research, especially in the preparation of the paired training data. In this paper, document image binarization with unpaired data is studied by introducing adversarial learning, excluding the need for supervised or labeled datasets. However, the simple extension of the previous unpaired training to binarization inevitably leads to poor performance compared to paired data training. Thus, a new deep learning approach is proposed by introducing a multi-diversity of higher quality generated images. In this paper, a two-stage model is proposed that comprises the generative adversarial network (GAN) followed by the U-net network. In the first stage, the GAN uses the unpaired image data to create paired image data. With the second stage, the generated paired image data are passed through the U-net network for binarization. Thus, the trained U-net becomes the binarization model during the testing. The proposed model has been evaluated over the publicly available DIBCO dataset and it outperforms other techniques on unpaired training data. The paper shows the potential of using unpaired data for binarization, for the first time in the literature, which can be further improved to replace paired data training for binarization in the future.