• Title/Summary/Keyword: Text Model learning

Search Result 433, Processing Time 0.024 seconds

Automatic Extraction of References for Research Reports using Deep Learning Language Model (딥러닝 언어 모델을 이용한 연구보고서의 참고문헌 자동추출 연구)

  • Yukyung Han;Wonsuk Choi;Minchul Lee
    • Journal of the Korean Society for information Management
    • /
    • v.40 no.2
    • /
    • pp.115-135
    • /
    • 2023
  • The purpose of this study is to assess the effectiveness of using deep learning language models to extract references automatically and create a reference database for research reports in an efficient manner. Unlike academic journals, research reports present difficulties in automatically extracting references due to variations in formatting across institutions. In this study, we addressed this issue by introducing the task of separating references from non-reference phrases, in addition to the commonly used metadata extraction task for reference extraction. The study employed datasets that included various types of references, such as those from research reports of a particular institution, academic journals, and a combination of academic journal references and non-reference texts. Two deep learning language models, namely RoBERTa+CRF and ChatGPT, were compared to evaluate their performance in automatic extraction. They were used to extract metadata, categorize data types, and separate original text. The research findings showed that the deep learning language models were highly effective, achieving maximum F1-scores of 95.41% for metadata extraction and 98.91% for categorization of data types and separation of the original text. These results provide valuable insights into the use of deep learning language models and different types of datasets for constructing reference databases for research reports including both reference and non-reference texts.

Fine-Grained Mobile Application Clustering Model Using Retrofitted Document Embedding

  • Yoon, Yeo-Chan;Lee, Junwoo;Park, So-Young;Lee, Changki
    • ETRI Journal
    • /
    • v.39 no.4
    • /
    • pp.443-454
    • /
    • 2017
  • In this paper, we propose a fine-grained mobile application clustering model using retrofitted document embedding. To automatically determine the clusters and their numbers with no predefined categories, the proposed model initializes the clusters based on title keywords and then merges similar clusters. For improved clustering performance, the proposed model distinguishes between an accurate clustering step with titles and an expansive clustering step with descriptions. During the accurate clustering step, an automatically tagged set is constructed as a result. This set is utilized to learn a high-performance document vector. During the expansive clustering step, more applications are then classified using this document vector. Experimental results showed that the purity of the proposed model increased by 0.19, and the entropy decreased by 1.18, compared with the K-means algorithm. In addition, the mean average precision improved by more than 0.09 in a comparison with a support vector machine classifier.

Prediction of Highy Pathogenic Avian Influenza(HPAI) Diffusion Path Using LSTM (LSTM을 활용한 고위험성 조류인플루엔자(HPAI) 확산 경로 예측)

  • Choi, Dae-Woo;Lee, Won-Been;Song, Yu-Han;Kang, Tae-Hun;Han, Ye-Ji
    • The Journal of Bigdata
    • /
    • v.5 no.1
    • /
    • pp.1-9
    • /
    • 2020
  • The study was conducted with funding from the government (Ministry of Agriculture, Food and Rural Affairs) in 2018 with support from the Agricultural, Food, and Rural Affairs Agency, 318069-03-HD040, and in based on artificial intelligence-based HPAI spread analysis and patterning. The model that is actively used in time series and text mining recently is LSTM (Long Short-Term Memory Models) model utilizing deep learning model structure. The LSTM model is a model that emerged to resolve the Long-Term Dependency Problem that occurs during the Backpropagation Through Time (BPTT) process of RNN. LSTM models have resolved the problem of forecasting very well using variable sequence data, and are still widely used.In this paper study, we used the data of the Call Detailed Record (CDR) provided by KT to identify the migration path of people who are expected to be closely related to the virus. Introduce the results of predicting the path of movement by learning the LSTM model using the path of the person concerned. The results of this study could be used to predict the route of HPAI propagation and to select routes or areas to focus on quarantine and to reduce HPAI spread.

Analysis of deep learning-based deep clustering method (딥러닝 기반의 딥 클러스터링 방법에 대한 분석)

  • Hyun Kwon;Jun Lee
    • Convergence Security Journal
    • /
    • v.23 no.4
    • /
    • pp.61-70
    • /
    • 2023
  • Clustering is an unsupervised learning method that involves grouping data based on features such as distance metrics, using data without known labels or ground truth values. This method has the advantage of being applicable to various types of data, including images, text, and audio, without the need for labeling. Traditional clustering techniques involve applying dimensionality reduction methods or extracting specific features to perform clustering. However, with the advancement of deep learning models, research on deep clustering techniques using techniques such as autoencoders and generative adversarial networks, which represent input data as latent vectors, has emerged. In this study, we propose a deep clustering technique based on deep learning. In this approach, we use an autoencoder to transform the input data into latent vectors, and then construct a vector space according to the cluster structure and perform k-means clustering. We conducted experiments using the MNIST and Fashion-MNIST datasets in the PyTorch machine learning library as the experimental environment. The model used is a convolutional neural network-based autoencoder model. The experimental results show an accuracy of 89.42% for MNIST and 56.64% for Fashion-MNIST when k is set to 10.

TextNAS Application to Multivariate Time Series Data and Hand Gesture Recognition (textNAS의 다변수 시계열 데이터로의 적용 및 손동작 인식)

  • Kim, Gi-duk;Kim, Mi-sook;Lee, Hack-man
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.518-520
    • /
    • 2021
  • In this paper, we propose a hand gesture recognition method by modifying the textNAS used for text classification so that it can be applied to multivariate time series data. It can be applied to various fields such as behavior recognition, emotion recognition, and hand gesture recognition through multivariate time series data classification. In addition, it automatically finds a deep learning model suitable for classification through training, thereby reducing the burden on users and obtaining high-performance class classification accuracy. By applying the proposed method to the DHG-14/28 and Shrec'17 datasets, which are hand gesture recognition datasets, it was possible to obtain higher class classification accuracy than the existing models. The classification accuracy was 98.72% and 98.16% for DHG-14/28, and 97.82% and 98.39% for Shrec'17 14 class/28 class.

  • PDF

Fire Accident Analysis of Hazardous Materials Using Data Analytics (Data Analytics를 활용한 위험물 화재사고 분석)

  • Shin, Eun-Ji;Koh, Moon-Soo;Shin, Dongil
    • Journal of the Korean Institute of Gas
    • /
    • v.24 no.5
    • /
    • pp.47-55
    • /
    • 2020
  • Hazardous materials accidents are not limited to the leakage of the material, but if the early response is not appropriate, it can lead to a fire or an explosion, which increases the scale of the damage. However, as the 4th industrial revolution and the rise of the big data era are being discussed, systematic analysis of hazardous materials accidents based on new techniques has not been attempted, but simple statistics are being collected. In this study, we perform the systematic analysis, using machine learning, on the fire accident data for the past 11 years (2008 ~ 2018), accumulated by the National Fire Service. The analysis results are visualized and presented through text mining analysis, and the possibility of developing a damage-scale prediction model is explored by applying the regression analysis method, using the main factors present in the hazardous materials fire accident data.

A Study on Text Mining Analysis of Presidential Maritime Concept in KOREA (텍스트마이닝을 이용한 한국 대통령의 해양관에 관한 연구)

  • Kim, Sung-Kuk;Lee, Tae-Hwee
    • Journal of Korea Port Economic Association
    • /
    • v.36 no.3
    • /
    • pp.39-54
    • /
    • 2020
  • In the presidential political system, the word of the president has great influence on the formation of national policy and the decision-making process. Policy priorities are determined according to the president's ideology and core values, and various policies are established and executed according to the priorities. Therefore, this paper analyzes the contents of the president's speech. Since the president's speech is a semantic datum, in order to analyze unstructured text, big data analysis is conducted through the methods of machine learning and deep learning. In this study, the president's speech at the "National Sea Day" commemoration was obtained 1996 onwards and analyzed using topic modeling. As a result of the analysis, all the presidents' speeches were delivered with a view of the ocean that was consistent with the direction of their administration. It was confirmed that the ocean-industry-resource topics, which are the intrinsic values of the ocean, were not damaged and consistently emphasized by all presidents.

Classification Accuracy by Deviation-based Classification Method with the Number of Training Documents (학습문서의 개수에 따른 편차기반 분류방법의 분류 정확도)

  • Lee, Yong-Bae
    • Journal of Digital Convergence
    • /
    • v.12 no.6
    • /
    • pp.325-332
    • /
    • 2014
  • It is generally accepted that classification accuracy is affected by the number of learning documents, but there are few studies that show how this influences automatic text classification. This study is focused on evaluating the deviation-based classification model which is developed recently for genre-based classification and comparing it to other classification algorithms with the changing number of training documents. Experiment results show that the deviation-based classification model performs with a superior accuracy of 0.8 from categorizing 7 genres with only 21 training documents. This exceeds the accuracy of Bayesian and SVM. The Deviation-based classification model obtains strong feature selection capability even with small number of training documents because it learns subject information within genre while other methods use different learning process.

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

The Detection of Online Manipulated Reviews Using Machine Learning and GPT-3 (기계학습과 GPT3를 시용한 조작된 리뷰의 탐지)

  • Chernyaeva, Olga;Hong, Taeho
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.347-364
    • /
    • 2022
  • Fraudulent companies or sellers strategically manipulate reviews to influence customers' purchase decisions; therefore, the reliability of reviews has become crucial for customer decision-making. Since customers increasingly rely on online reviews to search for more detailed information about products or services before purchasing, many researchers focus on detecting manipulated reviews. However, the main problem in detecting manipulated reviews is the difficulties with obtaining data with manipulated reviews to utilize machine learning techniques with sufficient data. Also, the number of manipulated reviews is insufficient compared with the number of non-manipulated reviews, so the class imbalance problem occurs. The class with fewer examples is under-represented and can hamper a model's accuracy, so machine learning methods suffer from the class imbalance problem and solving the class imbalance problem is important to build an accurate model for detecting manipulated reviews. Thus, we propose an OpenAI-based reviews generation model to solve the manipulated reviews imbalance problem, thereby enhancing the accuracy of manipulated reviews detection. In this research, we applied the novel autoregressive language model - GPT-3 to generate reviews based on manipulated reviews. Moreover, we found that applying GPT-3 model for oversampling manipulated reviews can recover a satisfactory portion of performance losses and shows better performance in classification (logit, decision tree, neural networks) than traditional oversampling models such as random oversampling and SMOTE.