• Title/Summary/Keyword: train dataset

Search Result 155, Processing Time 0.023 seconds

Export Prediction Using Separated Learning Method and Recommendation of Potential Export Countries (분리학습 모델을 이용한 수출액 예측 및 수출 유망국가 추천)

  • Jang, Yeongjin;Won, Jongkwan;Lee, Chaerok
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.69-88
    • /
    • 2022
  • One of the characteristics of South Korea's economic structure is that it is highly dependent on exports. Thus, many businesses are closely related to the global economy and diplomatic situation. In addition, small and medium-sized enterprises(SMEs) specialized in exporting are struggling due to the spread of COVID-19. Therefore, this study aimed to develop a model to forecast exports for next year to support SMEs' export strategy and decision making. Also, this study proposed a strategy to recommend promising export countries of each item based on the forecasting model. We analyzed important variables used in previous studies such as country-specific, item-specific, and macro-economic variables and collected those variables to train our prediction model. Next, through the exploratory data analysis(EDA) it was found that exports, which is a target variable, have a highly skewed distribution. To deal with this issue and improve predictive performance, we suggest a separated learning method. In a separated learning method, the whole dataset is divided into homogeneous subgroups and a prediction algorithm is applied to each group. Thus, characteristics of each group can be more precisely trained using different input variables and algorithms. In this study, we divided the dataset into five subgroups based on the exports to decrease skewness of the target variable. After the separation, we found that each group has different characteristics in countries and goods. For example, In Group 1, most of the exporting countries are developing countries and the majority of exporting goods are low value products such as glass and prints. On the other hand, major exporting countries of South Korea such as China, USA, and Vietnam are included in Group 4 and Group 5 and most exporting goods in these groups are high value products. Then we used LightGBM(LGBM) and Exponential Moving Average(EMA) for prediction. Considering the characteristics of each group, models were built using LGBM for Group 1 to 4 and EMA for Group 5. To evaluate the performance of the model, we compare different model structures and algorithms. As a result, it was found that the separated learning model had best performance compared to other models. After the model was built, we also provided variable importance of each group using SHAP-value to add explainability of our model. Based on the prediction model, we proposed a second-stage recommendation strategy for potential export countries. In the first phase, BCG matrix was used to find Star and Question Mark markets that are expected to grow rapidly. In the second phase, we calculated scores for each country and recommendations were made according to ranking. Using this recommendation framework, potential export countries were selected and information about those countries for each item was presented. There are several implications of this study. First of all, most of the preceding studies have conducted research on the specific situation or country. However, this study use various variables and develops a machine learning model for a wide range of countries and items. Second, as to our knowledge, it is the first attempt to adopt a separated learning method for exports prediction. By separating the dataset into 5 homogeneous subgroups, we could enhance the predictive performance of the model. Also, more detailed explanation of models by group is provided using SHAP values. Lastly, this study has several practical implications. There are some platforms which serve trade information including KOTRA, but most of them are based on past data. Therefore, it is not easy for companies to predict future trends. By utilizing the model and recommendation strategy in this research, trade related services in each platform can be improved so that companies including SMEs can fully utilize the service when making strategies and decisions for exports.

Multi-Variate Tabular Data Processing and Visualization Scheme for Machine Learning based Analysis: A Case Study using Titanic Dataset (기계 학습 기반 분석을 위한 다변량 정형 데이터 처리 및 시각화 방법: Titanic 데이터셋 적용 사례 연구)

  • Juhyoung Sung;Kiwon Kwon;Kyoungwon Park;Byoungchul Song
    • Journal of Internet Computing and Services
    • /
    • v.25 no.4
    • /
    • pp.121-130
    • /
    • 2024
  • As internet and communication technology (ICT) is improved exponentially, types and amount of available data also increase. Even though data analysis including statistics is significant to utilize this large amount of data, there are inevitable limits to process various and complex data in general way. Meanwhile, there are many attempts to apply machine learning (ML) in various fields to solve the problems according to the enhancement in computational performance and increase in demands for autonomous systems. Especially, data processing for the model input and designing the model to solve the objective function are critical to achieve the model performance. Data processing methods according to the type and property have been presented through many studies and the performance of ML highly varies depending on the methods. Nevertheless, there are difficulties in deciding which data processing method for data analysis since the types and characteristics of data have become more diverse. Specifically, multi-variate data processing is essential for solving non-linear problem based on ML. In this paper, we present a multi-variate tabular data processing scheme for ML-aided data analysis by using Titanic dataset from Kaggle including various kinds of data. We present the methods like input variable filtering applying statistical analysis and normalization according to the data property. In addition, we analyze the data structure using visualization. Lastly, we design an ML model and train the model by applying the proposed multi-variate data process. After that, we analyze the passenger's survival prediction performance of the trained model. We expect that the proposed multi-variate data processing and visualization can be extended to various environments for ML based analysis.

Prediction of Shear Wave Velocity on Sand Using Standard Penetration Test Results : Application of Artificial Neural Network Model (표준관입시험결과를 이용한 사질토 지반의 전단파속도 예측 : 인공신경망 모델의 적용)

  • Kim, Bum-Joo;Ho, Joon-Ki;Hwang, Young-Cheol
    • Journal of the Korean Geotechnical Society
    • /
    • v.30 no.5
    • /
    • pp.47-54
    • /
    • 2014
  • Although shear wave velocity ($V_s$) is an important design factor in seismic design, the measurement is not usually made in typical field investigation due to time and economic limitations. In the present study, an investigation was made to predict sand $V_s$ based on the standard penetration test (SPT) results by using artificial neural network (ANN) model. A total of 650 dataset composed of SPT-N value ($N_{60}$), water content, fine content, specific gravity for input data and $V_s$ for output data was used to build and train the ANN model. The sensitivity analysis was then performed for the trained ANN to examine the effect of the input variables on the $V_s$. Also, the ANN model was compared with seven existing empirical models on the performance. The sensitivity analysis results revealed that the effect of the SPT-N value on $V_s$ is significantly greater compared to other input variables. Also, when compared with the empirical models using Nash-Sutcliffe Model Efficiency Coefficient (NSE) and Root Mean Square Error (RMSE), the ANN model was found to exhibit the highest prediction capability.

Estimation of Manhattan Coordinate System using Convolutional Neural Network (합성곱 신경망 기반 맨하탄 좌표계 추정)

  • Lee, Jinwoo;Lee, Hyunjoon;Kim, Junho
    • Journal of the Korea Computer Graphics Society
    • /
    • v.23 no.3
    • /
    • pp.31-38
    • /
    • 2017
  • In this paper, we propose a system which estimates Manhattan coordinate systems for urban scene images using a convolutional neural network (CNN). Estimating the Manhattan coordinate system from an image under the Manhattan world assumption is the basis for solving computer graphics and vision problems such as image adjustment and 3D scene reconstruction. We construct a CNN that estimates Manhattan coordinate systems based on GoogLeNet [1]. To train the CNN, we collect about 155,000 images under the Manhattan world assumption by using the Google Street View APIs and calculate Manhattan coordinate systems using existing calibration methods to generate dataset. In contrast to PoseNet [2] that trains per-scene CNNs, our method learns from images under the Manhattan world assumption and thus estimates Manhattan coordinate systems for new images that have not been learned. Experimental results show that our method estimates Manhattan coordinate systems with the median error of $3.157^{\circ}$ for the Google Street View images of non-trained scenes, as test set. In addition, compared to an existing calibration method [3], the proposed method shows lower intermediate errors for the test set.

An Auto-Labeling based Smart Image Annotation System (자동-레이블링 기반 영상 학습데이터 제작 시스템)

  • Lee, Ryong;Jang, Rae-young;Park, Min-woo;Lee, Gunwoo;Choi, Myung-Seok
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.6
    • /
    • pp.701-715
    • /
    • 2021
  • The drastic advance of recent deep learning technologies is heavily dependent on training datasets which are essential to train models by themselves with less human efforts. In comparison with the work to design deep learning models, preparing datasets is a long haul; at the moment, in the domain of vision intelligent, datasets are still being made by handwork requiring a lot of time and efforts, where workers need to directly make labels on each image usually with GUI-based labeling tools. In this paper, we overview the current status of vision datasets focusing on what datasets are being shared and how they are prepared with various labeling tools. Particularly, in order to relieve the repetitive and tiring labeling work, we present an interactive smart image annotating system with which the annotation work can be transformed from the direct human-only manual labeling to a correction-after-checking by means of a support of automatic labeling. In an experiment, we show that automatic labeling can greatly improve the productivity of datasets especially reducing time and efforts to specify regions of objects found in images. Finally, we discuss critical issues that we faced in the experiment to our annotation system and describe future work to raise the productivity of image datasets creation for accelerating AI technology.

Sentiment Analysis of Product Reviews to Identify Deceptive Rating Information in Social Media: A SentiDeceptive Approach

  • Marwat, M. Irfan;Khan, Javed Ali;Alshehri, Dr. Mohammad Dahman;Ali, Muhammad Asghar;Hizbullah;Ali, Haider;Assam, Muhammad
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.3
    • /
    • pp.830-860
    • /
    • 2022
  • [Introduction] Nowadays, many companies are shifting their businesses online due to the growing trend among customers to buy and shop online, as people prefer online purchasing products. [Problem] Users share a vast amount of information about products, making it difficult and challenging for the end-users to make certain decisions. [Motivation] Therefore, we need a mechanism to automatically analyze end-user opinions, thoughts, or feelings in the social media platform about the products that might be useful for the customers to make or change their decisions about buying or purchasing specific products. [Proposed Solution] For this purpose, we proposed an automated SentiDecpective approach, which classifies end-user reviews into negative, positive, and neutral sentiments and identifies deceptive crowd-users rating information in the social media platform to help the user in decision-making. [Methodology] For this purpose, we first collected 11781 end-users comments from the Amazon store and Flipkart web application covering distant products, such as watches, mobile, shoes, clothes, and perfumes. Next, we develop a coding guideline used as a base for the comments annotation process. We then applied the content analysis approach and existing VADER library to annotate the end-user comments in the data set with the identified codes, which results in a labelled data set used as an input to the machine learning classifiers. Finally, we applied the sentiment analysis approach to identify the end-users opinions and overcome the deceptive rating information in the social media platforms by first preprocessing the input data to remove the irrelevant (stop words, special characters, etc.) data from the dataset, employing two standard resampling approaches to balance the data set, i-e, oversampling, and under-sampling, extract different features (TF-IDF and BOW) from the textual data in the data set and then train & test the machine learning algorithms by applying a standard cross-validation approach (KFold and Shuffle Split). [Results/Outcomes] Furthermore, to support our research study, we developed an automated tool that automatically analyzes each customer feedback and displays the collective sentiments of customers about a specific product with the help of a graph, which helps customers to make certain decisions. In a nutshell, our proposed sentiments approach produces good results when identifying the customer sentiments from the online user feedbacks, i-e, obtained an average 94.01% precision, 93.69% recall, and 93.81% F-measure value for classifying positive sentiments.

Development of Fender Segmentation System for Port Structures using Vision Sensor and Deep Learning (비전센서 및 딥러닝을 이용한 항만구조물 방충설비 세분화 시스템 개발)

  • Min, Jiyoung;Yu, Byeongjun;Kim, Jonghyeok;Jeon, Haemin
    • Journal of the Korea institute for structural maintenance and inspection
    • /
    • v.26 no.2
    • /
    • pp.28-36
    • /
    • 2022
  • As port structures are exposed to various extreme external loads such as wind (typhoons), sea waves, or collision with ships; it is important to evaluate the structural safety periodically. To monitor the port structure, especially the rubber fender, a fender segmentation system using a vision sensor and deep learning method has been proposed in this study. For fender segmentation, a new deep learning network that improves the encoder-decoder framework with the receptive field block convolution module inspired by the eccentric function of the human visual system into the DenseNet format has been proposed. In order to train the network, various fender images such as BP, V, cell, cylindrical, and tire-types have been collected, and the images are augmented by applying four augmentation methods such as elastic distortion, horizontal flip, color jitter, and affine transforms. The proposed algorithm has been trained and verified with the collected various types of fender images, and the performance results showed that the system precisely segmented in real time with high IoU rate (84%) and F1 score (90%) in comparison with the conventional segmentation model, VGG16 with U-net. The trained network has been applied to the real images taken at one port in Republic of Korea, and found that the fenders are segmented with high accuracy even with a small dataset.

Change Detection Using Deep Learning Based Semantic Segmentation for Nuclear Activity Detection and Monitoring (핵 활동 탐지 및 감시를 위한 딥러닝 기반 의미론적 분할을 활용한 변화 탐지)

  • Song, Ahram;Lee, Changhui;Lee, Jinmin;Han, Youkyung
    • Korean Journal of Remote Sensing
    • /
    • v.38 no.6_1
    • /
    • pp.991-1005
    • /
    • 2022
  • Satellite imaging is an effective supplementary data source for detecting and verifying nuclear activity. It is also highly beneficial in regions with limited access and information, such as nuclear installations. Time series analysis, in particular, can identify the process of preparing for the conduction of a nuclear experiment, such as relocating equipment or changing facilities. Differences in the semantic segmentation findings of time series photos were employed in this work to detect changes in meaningful items connected to nuclear activity. Building, road, and small object datasets made of KOMPSAT 3/3A photos given by AIHub were used to train deep learning models such as U-Net, PSPNet, and Attention U-Net. To pick relevant models for targets, many model parameters were adjusted. The final change detection was carried out by including object information into the first change detection, which was obtained as the difference in semantic segmentation findings. The experiment findings demonstrated that the suggested approach could effectively identify altered pixels. Although the suggested approach is dependent on the accuracy of semantic segmentation findings, it is envisaged that as the dataset for the region of interest grows in the future, so will the relevant scope of the proposed method.

Similar Contents Recommendation Model Based On Contents Meta Data Using Language Model (언어모델을 활용한 콘텐츠 메타 데이터 기반 유사 콘텐츠 추천 모델)

  • Donghwan Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.1
    • /
    • pp.27-40
    • /
    • 2023
  • With the increase in the spread of smart devices and the impact of COVID-19, the consumption of media contents through smart devices has significantly increased. Along with this trend, the amount of media contents viewed through OTT platforms is increasing, that makes contents recommendations on these platforms more important. Previous contents-based recommendation researches have mostly utilized metadata that describes the characteristics of the contents, with a shortage of researches that utilize the contents' own descriptive metadata. In this paper, various text data including titles and synopses that describe the contents were used to recommend similar contents. KLUE-RoBERTa-large, a Korean language model with excellent performance, was used to train the model on the text data. A dataset of over 20,000 contents metadata including titles, synopses, composite genres, directors, actors, and hash tags information was used as training data. To enter the various text features into the language model, the features were concatenated using special tokens that indicate each feature. The test set was designed to promote the relative and objective nature of the model's similarity classification ability by using the three contents comparison method and applying multiple inspections to label the test set. Genres classification and hash tag classification prediction tasks were used to fine-tune the embeddings for the contents meta text data. As a result, the hash tag classification model showed an accuracy of over 90% based on the similarity test set, which was more than 9% better than the baseline language model. Through hash tag classification training, it was found that the language model's ability to classify similar contents was improved, which demonstrated the value of using a language model for the contents-based filtering.

Corporate Bankruptcy Prediction Model using Explainable AI-based Feature Selection (설명가능 AI 기반의 변수선정을 이용한 기업부실예측모형)

  • Gundoo Moon;Kyoung-jae Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.2
    • /
    • pp.241-265
    • /
    • 2023
  • A corporate insolvency prediction model serves as a vital tool for objectively monitoring the financial condition of companies. It enables timely warnings, facilitates responsive actions, and supports the formulation of effective management strategies to mitigate bankruptcy risks and enhance performance. Investors and financial institutions utilize default prediction models to minimize financial losses. As the interest in utilizing artificial intelligence (AI) technology for corporate insolvency prediction grows, extensive research has been conducted in this domain. However, there is an increasing demand for explainable AI models in corporate insolvency prediction, emphasizing interpretability and reliability. The SHAP (SHapley Additive exPlanations) technique has gained significant popularity and has demonstrated strong performance in various applications. Nonetheless, it has limitations such as computational cost, processing time, and scalability concerns based on the number of variables. This study introduces a novel approach to variable selection that reduces the number of variables by averaging SHAP values from bootstrapped data subsets instead of using the entire dataset. This technique aims to improve computational efficiency while maintaining excellent predictive performance. To obtain classification results, we aim to train random forest, XGBoost, and C5.0 models using carefully selected variables with high interpretability. The classification accuracy of the ensemble model, generated through soft voting as the goal of high-performance model design, is compared with the individual models. The study leverages data from 1,698 Korean light industrial companies and employs bootstrapping to create distinct data groups. Logistic Regression is employed to calculate SHAP values for each data group, and their averages are computed to derive the final SHAP values. The proposed model enhances interpretability and aims to achieve superior predictive performance.