• 제목/요약/키워드: datasets

검색결과 2,005건 처리시간 0.049초

Zero-shot Korean Sentiment Analysis with Large Language Models: Comparison with Pre-trained Language Models

  • Soon-Chan Kwon;Dong-Hee Lee;Beak-Cheol Jang
    • 한국컴퓨터정보학회논문지
    • /
    • 제29권2호
    • /
    • pp.43-50
    • /
    • 2024
  • 본 논문은 GPT-3.5 및 GPT-4와 같은 대규모 언어 모델의 한국어 감성 분석 성능을 ChatGPT API를 활용한 zero-shot 방법으로 평가하고, 이를 KoBERT와 같은 사전 학습된 한국어 모델들과 비교한다. 실험을 통해 영화, 게임, 쇼핑 등 다양한 분야의 한국어 감성 분석 데이터셋을 사용하여 모델들의 효율성을 검증한다. 실험 결과, LMKor-ELECTRA 모델이 F1-score 기준으로 가장 높은 성능을 보여주었으며, GPT-4는 특히 영화 및 쇼핑 데이터셋에서 높은 정확도와 F1-score를 기록하였다. 이는 zero-shot 학습 방식의 대규모 언어 모델이 특정 데이터셋에 대한 사전 학습 없이도 한국어 감성 분석에서 높은 성능을 발휘할 수 있음을 시사한다. 그러나 일부 데이터셋에서의 상대적으로 낮은 성능은 zero-shot 기반 방법론의 한계점으로 지적될 수 있다. 본 연구는 대규모 언어 모델의 한국어 감성 분석 활용 가능성을 탐구하며, 이 분야의 향후 연구 방향에 중요한 시사점을 제공한다.

Construction of Text Summarization Corpus in Economics Domain and Baseline Models

  • Sawittree Jumpathong;Akkharawoot Takhom;Prachya Boonkwan;Vipas Sutantayawalee;Peerachet Porkaew;Sitthaa Phaholphinyo;Charun Phrombut;Khemarath Choke-mangmi;Saran Yamasathien;Nattachai Tretasayuth;Kasidis Kanwatchara;Atiwat Aiemleuk;Thepchai Supnithi
    • Journal of information and communication convergence engineering
    • /
    • 제22권1호
    • /
    • pp.33-43
    • /
    • 2024
  • Automated text summarization (ATS) systems rely on language resources as datasets. However, creating these datasets is a complex and labor-intensive task requiring linguists to extensively annotate the data. Consequently, certain public datasets for ATS, particularly in languages such as Thai, are not as readily available as those for the more popular languages. The primary objective of the ATS approach is to condense large volumes of text into shorter summaries, thereby reducing the time required to extract information from extensive textual data. Owing to the challenges involved in preparing language resources, publicly accessible datasets for Thai ATS are relatively scarce compared to those for widely used languages. The goal is to produce concise summaries and accelerate the information extraction process using vast amounts of textual input. This study introduced ThEconSum, an ATS architecture specifically designed for Thai language, using economy-related data. An evaluation of this research revealed the significant remaining tasks and limitations of the Thai language.

Plurality Rule-based Density and Correlation Coefficient-based Clustering for K-NN

  • Aung, Swe Swe;Nagayama, Itaru;Tamaki, Shiro
    • IEIE Transactions on Smart Processing and Computing
    • /
    • 제6권3호
    • /
    • pp.183-192
    • /
    • 2017
  • k-nearest neighbor (K-NN) is a well-known classification algorithm, being feature space-based on nearest-neighbor training examples in machine learning. However, K-NN, as we know, is a lazy learning method. Therefore, if a K-NN-based system very much depends on a huge amount of history data to achieve an accurate prediction result for a particular task, it gradually faces a processing-time performance-degradation problem. We have noticed that many researchers usually contemplate only classification accuracy. But estimation speed also plays an essential role in real-time prediction systems. To compensate for this weakness, this paper proposes correlation coefficient-based clustering (CCC) aimed at upgrading the performance of K-NN by leveraging processing-time speed and plurality rule-based density (PRD) to improve estimation accuracy. For experiments, we used real datasets (on breast cancer, breast tissue, heart, and the iris) from the University of California, Irvine (UCI) machine learning repository. Moreover, real traffic data collected from Ojana Junction, Route 58, Okinawa, Japan, was also utilized to lay bare the efficiency of this method. By using these datasets, we proved better processing-time performance with the new approach by comparing it with classical K-NN. Besides, via experiments on real-world datasets, we compared the prediction accuracy of our approach with density peaks clustering based on K-NN and principal component analysis (DPC-KNN-PCA).

DeepCleanNet: Training Deep Convolutional Neural Network with Extremely Noisy Labels

  • Olimov, Bekhzod;Kim, Jeonghong
    • 한국멀티미디어학회논문지
    • /
    • 제23권11호
    • /
    • pp.1349-1360
    • /
    • 2020
  • In recent years, Convolutional Neural Networks (CNNs) have been successfully implemented in different tasks of computer vision. Since CNN models are the representatives of supervised learning algorithms, they demand large amount of data in order to train the classifiers. Thus, obtaining data with correct labels is imperative to attain the state-of-the-art performance of the CNN models. However, labelling datasets is quite tedious and expensive process, therefore real-life datasets often exhibit incorrect labels. Although the issue of poorly labelled datasets has been studied before, we have noticed that the methods are very complex and hard to reproduce. Therefore, in this research work, we propose Deep CleanNet - a considerably simple system that achieves competitive results when compared to the existing methods. We use K-means clustering algorithm for selecting data with correct labels and train the new dataset using a deep CNN model. The technique achieves competitive results in both training and validation stages. We conducted experiments using MNIST database of handwritten digits with 50% corrupted labels and achieved up to 10 and 20% increase in training and validation sets accuracy scores, respectively.

범용 데이터 셋과 얼굴 데이터 셋에 대한 초해상도 융합 기법 (Super Resolution Fusion Scheme for General- and Face Dataset)

  • 문준원;김재석
    • 한국멀티미디어학회논문지
    • /
    • 제22권11호
    • /
    • pp.1242-1250
    • /
    • 2019
  • Super resolution technique aims to convert a low-resolution image with coarse details to a corresponding high-resolution image with refined details. In the past decades, the performance is greatly improved due to progress of deep learning models. However, universal solution for various objects is a still challenging issue. We observe that learning super resolution with a general dataset has poor performance on faces. In this paper, we propose a super resolution fusion scheme that works well for both general- and face datasets to achieve more universal solution. In addition, object-specific feature extractor is employed for better reconstruction performance. In our experiments, we compare our fusion image and super-resolved images from one- of the state-of-the-art deep learning models trained with DIV2K and FFHQ datasets. Quantitative and qualitative evaluates show that our fusion scheme successfully works well for both datasets. We expect our fusion scheme to be effective on other objects with poor performance and this will lead to universal solutions.

Modal identifiability of a cable-stayed bridge using proper orthogonal decomposition

  • Li, M.;Ni, Y.Q.
    • Smart Structures and Systems
    • /
    • 제17권3호
    • /
    • pp.413-429
    • /
    • 2016
  • The recent research on proper orthogonal decomposition (POD) has revealed the linkage between proper orthogonal modes and linear normal modes. This paper presents an investigation into the modal identifiability of an instrumented cable-stayed bridge using an adapted POD technique with a band-pass filtering scheme. The band-pass POD method is applied to the datasets available for this benchmark study, aiming to identify the vibration modes of the bridge and find out the so-called deficient modes which are unidentifiable under normal excitation conditions. It turns out that the second mode of the bridge cannot be stably identified under weak wind conditions and is therefore regarded as a deficient mode. To judge if the deficient mode is due to its low contribution to the structural response under weak wind conditions, modal coordinates are derived for different modes by the band-pass POD technique and an energy participation factor is defined to evaluate the energy participation of each vibration mode under different wind excitation conditions. From the non-blind datasets, it is found that the vibration modes can be reliably identified only when the energy participation factor exceeds a certain threshold value. With the identified threshold value, modal identifiability in use of the blind datasets from the same structure is examined.

Approximate k values using Repulsive Force without Domain Knowledge in k-means

  • Kim, Jung-Jae;Ryu, Minwoo;Cha, Si-Ho
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제14권3호
    • /
    • pp.976-990
    • /
    • 2020
  • The k-means algorithm is widely used in academia and industry due to easy and simple implementation, enabling fast learning for complex datasets. However, k-means struggles to classify datasets without prior knowledge of specific domains. We proposed the repulsive k-means (RK-means) algorithm in a previous study to improve the k-means algorithm, using the repulsive force concept, which allows deleting unnecessary cluster centroids. Accordingly, the RK-means enables to classifying of a dataset without domain knowledge. However, three main problems remain. The RK-means algorithm includes a cluster repulsive force offset, for clusters confined in other clusters, which can cause cluster locking; we were unable to prove RK-means provided optimal convergence in the previous study; and RK-means shown better performance only normalize term and weight. Therefore, this paper proposes the advanced RK-means (ARK-means) algorithm to resolve the RK-means problems. We establish an initialization strategy for deploying cluster centroids and define a metric for the ARK-means algorithm. Finally, we redefine the mass and normalize terms to close to the general dataset. We show ARK-means feasibility experimentally using blob and iris datasets. Experiment results verify the proposed ARK-means algorithm provides better performance than k-means, k'-means, and RK-means.

Disease Prediction Using Ranks of Gene Expressions

  • Kim, Ki-Yeol;Ki, Dong-Hyuk;Chung, Hyun-Cheol;Rha, Sun-Young
    • Genomics & Informatics
    • /
    • 제6권3호
    • /
    • pp.136-141
    • /
    • 2008
  • A large number of studies have been performed to identify biomarkers that will allow efficient detection and determination of the precise status of a patient’s disease. The use of microarrays to assess biomarker status is expected to improve prediction accuracies, because a whole-genome approach is used. Despite their potential, however, patient samples can differ with respect to biomarker status when analyzed on different platforms, making it more difficult to make accurate predictions, because bias may exist between any two different experimental conditions. Because of this difficulty in experimental standardization of microarray data, it is currently difficult to utilize microarray-based gene sets in the clinic. To address this problem, we propose a method that predicts disease status using gene expression data that are transformed by their ranks, a concept that is easily applied to two datasets that are obtained using different experimental platforms. NCI and colon cancer datasets, which were assessed using both Affymetrix and cDNA microarray platforms, were used for method validation. Our results demonstrate that the proposed method is able to achieve good predictive performance for datasets that are obtained under different experimental conditions.

A Study on Driving Simulation and Efficiency Maps with Nonlinear IPMSM Datasets

  • Kim, Won-Ho;Jang, Ik-Sang;Lee, Ki-Doek;Im, Jong-Bin;Jin, Chang-Sung;Koo, Dae-Hyun;Lee, Ju
    • Journal of Magnetics
    • /
    • 제16권1호
    • /
    • pp.71-73
    • /
    • 2011
  • Hybrid electric vehicles have attracted much attention of late, emphasizing the necessity of developing traction motors with a high input current and a wide speed range. Among such traction motors, various researches have been conducted on interior permanent-magnet synchronous motors (IPMSMs) with high power density and mechanical solidity. Due to the complexity of its parameters, however, with nonlinear motor characteristics and current vector control, it is actually difficult to accurately estimate the base speed within an actual operating speed range or a voltage limit. Moreover, it is impossible to construct an efficiency map as the efficiency differs according to the control mode. In this study, a simulation method for operation performance considering the nonlinearity of IPMSM was proposed. For this, datasets of various nonlinear parameters were made via the finite-element method and interpolation. Maximum torque-per-ampere and flux-weakening control were accurately simulated using the datasets, and an IPMSM efficiency map was accurately constructed based on the simulation. Lastly, the validity of the simulation was verified through tests.

Incremental Fuzzy Clustering Based on a Fuzzy Scatter Matrix

  • Liu, Yongli;Wang, Hengda;Duan, Tianyi;Chen, Jingli;Chao, Hao
    • Journal of Information Processing Systems
    • /
    • 제15권2호
    • /
    • pp.359-373
    • /
    • 2019
  • For clustering large-scale data, which cannot be loaded into memory entirely, incremental clustering algorithms are very popular. Usually, these algorithms only concern the within-cluster compactness and ignore the between-cluster separation. In this paper, we propose two incremental fuzzy compactness and separation (FCS) clustering algorithms, Single-Pass FCS (SPFCS) and Online FCS (OFCS), based on a fuzzy scatter matrix. Firstly, we introduce two incremental clustering methods called single-pass and online fuzzy C-means algorithms. Then, we combine these two methods separately with the weighted fuzzy C-means algorithm, so that they can be applied to the FCS algorithm. Afterwards, we optimize the within-cluster matrix and betweencluster matrix simultaneously to obtain the minimum within-cluster distance and maximum between-cluster distance. Finally, large-scale datasets can be well clustered within limited memory. We implemented experiments on some artificial datasets and real datasets separately. And experimental results show that, compared with SPFCM and OFCM, our SPFCS and OFCS are more robust to the value of fuzzy index m and noise.