• Title/Summary/Keyword: Research dataset

Search Result 1,324, Processing Time 0.028 seconds

Amazon product recommendation system based on a modified convolutional neural network

  • Yarasu Madhavi Latha;B. Srinivasa Rao
    • ETRI Journal
    • /
    • v.46 no.4
    • /
    • pp.633-647
    • /
    • 2024
  • In e-commerce platforms, sentiment analysis on an enormous number of user reviews efficiently enhances user satisfaction. In this article, an automated product recommendation system is developed based on machine and deep-learning models. In the initial step, the text data are acquired from the Amazon Product Reviews dataset, which includes 60 000 customer reviews with 14 806 neutral reviews, 19 567 negative reviews, and 25 627 positive reviews. Further, the text data denoising is carried out using techniques such as stop word removal, stemming, segregation, lemmatization, and tokenization. Removing stop-words (duplicate and inconsistent text) and other denoising techniques improves the classification performance and decreases the training time of the model. Next, vectorization is accomplished utilizing the term frequency-inverse document frequency technique, which converts denoised text to numerical vectors for faster code execution. The obtained feature vectors are given to the modified convolutional neural network model for sentiment analysis on e-commerce platforms. The empirical result shows that the proposed model obtained a mean accuracy of 97.40% on the APR dataset.

3-D Manipulation of Brain Atlas

  • Paik, Chul-Hwa;Kim, Won-Ky
    • Proceedings of the KOSOMBE Conference
    • /
    • v.1995 no.05
    • /
    • pp.233-234
    • /
    • 1995
  • Tri-planar interpolation of the orthogonal digital brain Atlas is proposed to achieve a higher resolution of a volume-metric atlas. With these expanded dataset, the brain mapping will be accomplished with fewer registration errors.

  • PDF

Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis (부도예측을 위한 KNN 앙상블 모형의 동시 최적화)

  • Min, Sung-Hwan
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.139-157
    • /
    • 2016
  • Bankruptcy involves considerable costs, so it can have significant effects on a country's economy. Thus, bankruptcy prediction is an important issue. Over the past several decades, many researchers have addressed topics associated with bankruptcy prediction. Early research on bankruptcy prediction employed conventional statistical methods such as univariate analysis, discriminant analysis, multiple regression, and logistic regression. Later on, many studies began utilizing artificial intelligence techniques such as inductive learning, neural networks, and case-based reasoning. Currently, ensemble models are being utilized to enhance the accuracy of bankruptcy prediction. Ensemble classification involves combining multiple classifiers to obtain more accurate predictions than those obtained using individual models. Ensemble learning techniques are known to be very useful for improving the generalization ability of the classifier. Base classifiers in the ensemble must be as accurate and diverse as possible in order to enhance the generalization ability of an ensemble model. Commonly used methods for constructing ensemble classifiers include bagging, boosting, and random subspace. The random subspace method selects a random feature subset for each classifier from the original feature space to diversify the base classifiers of an ensemble. Each ensemble member is trained by a randomly chosen feature subspace from the original feature set, and predictions from each ensemble member are combined by an aggregation method. The k-nearest neighbors (KNN) classifier is robust with respect to variations in the dataset but is very sensitive to changes in the feature space. For this reason, KNN is a good classifier for the random subspace method. The KNN random subspace ensemble model has been shown to be very effective for improving an individual KNN model. The k parameter of KNN base classifiers and selected feature subsets for base classifiers play an important role in determining the performance of the KNN ensemble model. However, few studies have focused on optimizing the k parameter and feature subsets of base classifiers in the ensemble. This study proposed a new ensemble method that improves upon the performance KNN ensemble model by optimizing both k parameters and feature subsets of base classifiers. A genetic algorithm was used to optimize the KNN ensemble model and improve the prediction accuracy of the ensemble model. The proposed model was applied to a bankruptcy prediction problem by using a real dataset from Korean companies. The research data included 1800 externally non-audited firms that filed for bankruptcy (900 cases) or non-bankruptcy (900 cases). Initially, the dataset consisted of 134 financial ratios. Prior to the experiments, 75 financial ratios were selected based on an independent sample t-test of each financial ratio as an input variable and bankruptcy or non-bankruptcy as an output variable. Of these, 24 financial ratios were selected by using a logistic regression backward feature selection method. The complete dataset was separated into two parts: training and validation. The training dataset was further divided into two portions: one for the training model and the other to avoid overfitting. The prediction accuracy against this dataset was used to determine the fitness value in order to avoid overfitting. The validation dataset was used to evaluate the effectiveness of the final model. A 10-fold cross-validation was implemented to compare the performances of the proposed model and other models. To evaluate the effectiveness of the proposed model, the classification accuracy of the proposed model was compared with that of other models. The Q-statistic values and average classification accuracies of base classifiers were investigated. The experimental results showed that the proposed model outperformed other models, such as the single model and random subspace ensemble model.

A Study on Database Design Model for Production System Record Management Module in DataSet Record Management (데이터세트 기록관리를 위한 생산시스템 기록관리 모듈의 DB 설계 모형연구)

  • Kim, Dongsu;Yim, Jinhee;Kang, Sung-hee
    • The Korean Journal of Archival Studies
    • /
    • no.78
    • /
    • pp.153-195
    • /
    • 2023
  • RDBMS is a widely used database system worldwide, and the term dataset refers to the vast amount of data produced in administrative information systems using RDBMS. Unlike business systems that mainly produce administrative documents, administrative information systems generate records centered around the unique tasks of organizations. These records differ from traditional approval documents and metadata, making it challenging to seamlessly transfer them to standard record management systems. With the 2022 revision of the 'Public Records Act Enforcement Decree,' dataset was included in the types of records for which only management authority is transferred. The core aspect of this revision is the need to manage the lifecycle of records within administrative information systems. However, there has been little exploration into how to manage dataset within administrative information systems. As a result, this research aims to design a database for a record management module that needs to be integrated into administrative information systems to manage the lifecycle of records. By modifying and supplementing ISO 16175-1:2020, we are designing an "human resource management system" and identifying and evaluating personnel management dataset. Through this, we aim to provide a concrete example of record management within administrative information systems. It's worth noting that the prototype system designed in this research has limitations in terms of data volume compared to systems currently in use within organizations, and it has not yet been validated by record researchers and IT developers in the field. However, this endeavor has allowed us to understand the nature of dataset and how they should be managed within administrative information systems. It has also affirmed the need for a record management module's database within administrative information systems. In the future, once a complete record management module is developed and standards are established by the National Archives, it is expected to become a necessary module for organizations to manage dataset effectively.

A Construction of Geographical Distance-based Air Quality Dataset Using Hospital Location Information (병원위치정보를 이용한 지리적 거리기반의 대기환경 데이터셋 구축)

  • Kim, Hyeongsoo;Ryu, Keun Ho
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.34 no.3
    • /
    • pp.231-242
    • /
    • 2016
  • As of late, air quality information has been actively gathered and investigated in order to find possible environmental risk factors that may affect the onset of cardiovascular disease. Nevertheless, existing studies are limited in the detailed analysis because they take advantage of the air quality information of the macro statistics divided into administrative districts. This paper proposes the construction of distance-based air quality dataset using a domestic hospital’s geographical location information as a reliable data gathering step for a more detailed analysis of environmental risk factors. For the construction of the dataset, air quality information was obtained by utilizing the geographical location of a hospital—in which a patient with cardiovascular disease had been admitted—and then matching the hospital with a meteorological and air pollution station in its vicinity. An air quality acquisition system based on GMap.net was devised for the purpose of data gathering and visualization. The reliability of the experiment was confirmed by evaluating the matching rate and error of air quality values between the acquired dataset with existing area-based air quality datasets from matched distances. Therefore, this dataset, which considers geographical information, can be utilized in multidisciplinary research for the discovery of environmental risk factors that can affect not only cardiovascular diseases but also potentially other epidemic diseases.

Development of relational river data model based on river network for multi-dimensional river information system (다차원 하천정보체계 구축을 위한 하천네트워크 기반 관계형 하천 데이터 모델 개발)

  • Choi, Seungsoo;Kim, Dongsu;You, Hojun
    • Journal of Korea Water Resources Association
    • /
    • v.51 no.4
    • /
    • pp.335-346
    • /
    • 2018
  • A vast amount of riverine spatial dataset have recently become available, which include hydrodynamic and morphological survey data by advanced instrumentations such as ADCP (Acoustic Doppler Current Profiler), transect measurements obtained through building various river basic plans, riverine environmental and ecological data, optical images using UAVs, river facilities like multi-purposed weir and hydrophilic sectors. In this regard, a standardized data model has been subsequently required in order to efficiently store, manage, and share riverine spatial dataset. Given that riverine spatial dataset such as river facility, transect measurement, time-varying observed data should be synthetically managed along specified river network, conventional data model showed a tendency to maintain them individually in a form of separate layer corresponding to each theme, which can miss their spatial relationship, thereby resulting in inefficiency to derive synthetic information. Moreover, the data model had to be significantly modified to ingest newly produced data and hampered efficient searches for specific conditions. To avoid such drawbacks for layer-based data model, this research proposed a relational data model in conjunction with river network which could be a backbone to relate additional spatial dataset such as flowline, river facility, transect measurement and surveyed dataset. The new data model contains flexibility to minimize changes of its structure when it deals with any multi-dimensional river data, and assigned reach code for multiple river segments delineated from a river. To realize the newly developed data model, Seom river was applied, where geographic informations related with national and local rivers are available.

A study of development for movie recommendation system algorithm using filtering (필터링기법을 이용한 영화 추천시스템 알고리즘 개발에 관한 연구)

  • Kim, Sun Ok;Lee, Soo Yong;Lee, Seok Jun;Lee, Hee Choon;Ji, Seon Su
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.4
    • /
    • pp.803-813
    • /
    • 2013
  • The purchase of items in e-commerce is a little bit different from that of items in off-line. The recommendation of items in off-line is conducted by salespersons' recommendation, However, the item recommendation in e-commerce cannot be recommended by salespersons, and so different types of methods can be recommended in e-commerce. Recommender system is a method which recommends items in e-commerce. Preferences of customers who want to purchase new items can be predicted by the preferences of customers purchasing existing items. In the recommender system, the items with estimated high preferences can be recommended to customers. The algorithm of collaborative filtering is used in recommender system of e-commerce, and the list of recommended items is made by estimated values, and then the list is recommended to customers. The dataset used in this research are 100k dataset and 1 million dataset in Movielens dataset. Similar results in two dataset are deducted for generalization. To suggest a new algorithm, distribution features of estimated values are analyzed by the existing algorithm and transformed algorithm. In addition, respondent'distribution features are analyzed respectively. To improve the collaborative filtering algorithm in neighborhood recommender system, a new algorithm method is suggested on the basis of existing algorithm and transformed algorithm.

A Study on the Improvement of the Management Reference Tables for Datasets in Administrative Information Systems (행정정보 데이터세트의 관리기준표 개선방안 연구)

  • Lee, Jung-eun;Kim, Ji-Hye;Wang, Ho-sung;Yang, Dongmin
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.22 no.1
    • /
    • pp.177-200
    • /
    • 2022
  • Administrative information datasets are a kind of record produced based on an organization's work performance. A dataset is evidence of the act of recording and contains a lot of information that can be used for work. Datasets have been neglected in Korea's records management system. However, as the law was revised in 2020, the management of administrative information datasets was legislated. Organizations that require management of administrative information datasets have already gradually begun record management. The core of managing administrative information datasets is the preparation of the Management Reference Table for the dataset. Regardless, there is confusion with the Records Management Reference Table for Dataset in institutions that work on records management, and it is difficult to work because the Management Reference Table for Dataset has a new concept. This study looked into the problems in the records management of datasets that appeared at the beginning of work. It isuggests a method to effectively settle records management for datasets. In that way, the Management Reference Table was selected as the research subject, and the problems discussed so far were summarized. In addition, the items of the current Management Reference Table were analyzed. As a result of the study, we have proposed the simplification of items in the Management Reference Table, the reorganization of areas in the Management Reference Table, the introduction of the concept of retention periods, and the preparation process of the Management Reference Table.

A Time Series Graph based Convolutional Neural Network Model for Effective Input Variable Pattern Learning : Application to the Prediction of Stock Market (효과적인 입력변수 패턴 학습을 위한 시계열 그래프 기반 합성곱 신경망 모형: 주식시장 예측에의 응용)

  • Lee, Mo-Se;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.167-181
    • /
    • 2018
  • Over the past decade, deep learning has been in spotlight among various machine learning algorithms. In particular, CNN(Convolutional Neural Network), which is known as the effective solution for recognizing and classifying images or voices, has been popularly applied to classification and prediction problems. In this study, we investigate the way to apply CNN in business problem solving. Specifically, this study propose to apply CNN to stock market prediction, one of the most challenging tasks in the machine learning research. As mentioned, CNN has strength in interpreting images. Thus, the model proposed in this study adopts CNN as the binary classifier that predicts stock market direction (upward or downward) by using time series graphs as its inputs. That is, our proposal is to build a machine learning algorithm that mimics an experts called 'technical analysts' who examine the graph of past price movement, and predict future financial price movements. Our proposed model named 'CNN-FG(Convolutional Neural Network using Fluctuation Graph)' consists of five steps. In the first step, it divides the dataset into the intervals of 5 days. And then, it creates time series graphs for the divided dataset in step 2. The size of the image in which the graph is drawn is $40(pixels){\times}40(pixels)$, and the graph of each independent variable was drawn using different colors. In step 3, the model converts the images into the matrices. Each image is converted into the combination of three matrices in order to express the value of the color using R(red), G(green), and B(blue) scale. In the next step, it splits the dataset of the graph images into training and validation datasets. We used 80% of the total dataset as the training dataset, and the remaining 20% as the validation dataset. And then, CNN classifiers are trained using the images of training dataset in the final step. Regarding the parameters of CNN-FG, we adopted two convolution filters ($5{\times}5{\times}6$ and $5{\times}5{\times}9$) in the convolution layer. In the pooling layer, $2{\times}2$ max pooling filter was used. The numbers of the nodes in two hidden layers were set to, respectively, 900 and 32, and the number of the nodes in the output layer was set to 2(one is for the prediction of upward trend, and the other one is for downward trend). Activation functions for the convolution layer and the hidden layer were set to ReLU(Rectified Linear Unit), and one for the output layer set to Softmax function. To validate our model - CNN-FG, we applied it to the prediction of KOSPI200 for 2,026 days in eight years (from 2009 to 2016). To match the proportions of the two groups in the independent variable (i.e. tomorrow's stock market movement), we selected 1,950 samples by applying random sampling. Finally, we built the training dataset using 80% of the total dataset (1,560 samples), and the validation dataset using 20% (390 samples). The dependent variables of the experimental dataset included twelve technical indicators popularly been used in the previous studies. They include Stochastic %K, Stochastic %D, Momentum, ROC(rate of change), LW %R(Larry William's %R), A/D oscillator(accumulation/distribution oscillator), OSCP(price oscillator), CCI(commodity channel index), and so on. To confirm the superiority of CNN-FG, we compared its prediction accuracy with the ones of other classification models. Experimental results showed that CNN-FG outperforms LOGIT(logistic regression), ANN(artificial neural network), and SVM(support vector machine) with the statistical significance. These empirical results imply that converting time series business data into graphs and building CNN-based classification models using these graphs can be effective from the perspective of prediction accuracy. Thus, this paper sheds a light on how to apply deep learning techniques to the domain of business problem solving.

ENERGY EFFICIENT BUILDING DESIGN THROUGH DATA MINING APPROACH

  • Hyunjoo Kim;Wooyoung Kim
    • International conference on construction engineering and project management
    • /
    • 2009.05a
    • /
    • pp.601-605
    • /
    • 2009
  • The objective of this research is to develop a knowledge discovery framework which can help project teams discover useful patterns to improve energy efficient building design. This paper utilizes the technology of data mining to automatically extract concepts, interrelationships and patterns of interest from a large dataset. By applying data mining technology to the analysis of energy efficient building designs one can identify valid, useful, and previously unknown patterns of energy simulation modeling.

  • PDF