• Title/Summary/Keyword: 자연어 처리 기법

Search Result 220, Processing Time 0.079 seconds

Comparison between Word Embedding Techniques in Traditional Korean Medicine for Data Analysis: Implementation of a Natural Language Processing Method (한의학 고문헌 데이터 분석을 위한 단어 임베딩 기법 비교: 자연어처리 방법을 적용하여)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.32 no.1
    • /
    • pp.61-74
    • /
    • 2019
  • Objectives : The purpose of this study is to help select an appropriate word embedding method when analyzing East Asian traditional medicine texts as data. Methods : Based on prescription data that imply traditional methods in traditional East Asian medicine, we have examined 4 count-based word embedding and 2 prediction-based word embedding methods. In order to intuitively compare these word embedding methods, we proposed a "prescription generating game" and compared its results with those from the application of the 6 methods. Results : When the adjacent vectors are extracted, the count-based word embedding method derives the main herbs that are frequently used in conjunction with each other. On the other hand, in the prediction-based word embedding method, the synonyms of the herbs were derived. Conclusions : Counting based word embedding methods seems to be more effective than prediction-based word embedding methods in analyzing the use of domesticated herbs. Among count-based word embedding methods, the TF-vector method tends to exaggerate the frequency effect, and hence the TF-IDF vector or co-word vector may be a more reasonable choice. Also, the t-score vector may be recommended in search for unusual information that could not be found in frequency. On the other hand, prediction-based embedding seems to be effective when deriving the bases of similar meanings in context.

Structuring Risk Factors of Industrial Incidents Using Natural Language Process (자연어 처리 기법을 활용한 산업재해 위험요인 구조화)

  • Kang, Sungsik;Chang, Seong Rok;Lee, Jongbin;Suh, Yongyoon
    • Journal of the Korean Society of Safety
    • /
    • v.36 no.1
    • /
    • pp.56-63
    • /
    • 2021
  • The narrative texts of industrial accident reports help to identify accident risk factors. They relate the accident triggers to the sequence of events and the outcomes of an accident. Particularly, a set of related keywords in the context of the narrative can represent how the accident proceeded. Previous studies on text analytics for structuring accident reports have been limited to extracting individual keywords without context. We proposed a context-based analysis using a Natural Language Processing (NLP) algorithm to remedy this shortcoming. This study aims to apply Word2Vec of the NLP algorithm to extract adjacent keywords, known as word embedding, conducted by the neural network algorithm based on supervised learning. During processing, Word2Vec is conducted by adjacent keywords in narrative texts as inputs to achieve its supervised learning; keyword weights emerge as the vectors representing the degree of neighboring among keywords. Similar keyword weights mean that the keywords are closely arranged within sentences in the narrative text. Consequently, a set of keywords that have similar weights presents similar accidents. We extracted ten accident processes containing related keywords and used them to understand the risk factors determining how an accident proceeds. This information helps identify how a checklist for an accident report should be structured.

Data Augmentation for Generating Counter Narratives against Hate Speech (혐오 표현에 대한 대응 발화 생성을 위한 데이터 증강 기법)

  • Seungyoon Lee;Suhyune Son;Dahyun Jung;Chanjun Park;Aram So;Heuiseok Lim
    • Annual Conference on Human and Language Technology
    • /
    • 2022.10a
    • /
    • pp.10-15
    • /
    • 2022
  • 온라인상에서 발생하는 혐오 표현은 사회가 직면한 주요 문제 중 하나이다. 이러한 필요성에 입각해, 최근 인공지능을 활용하여 발화에 대한 교화 목적을 가진 대응 발화 쌍을 통해 혐오 표현에 대한 실질적인 완화를 진행하는 연구들이 생겨나고 있다. 그러나 각 혐오 표현에 적합한 대응 발화의 구축은 다수의 전문 인력이 요구되므로 데이터를 구축함에 있어 시간과 비용이 많이 소요되며 대응 발화 생성 또한 어려운 문제로 여겨진다. 해당 문제를 완화하기위해, 본 논문은 사전에 기 구축되어 있는 혐오 표현 데이터를 기반으로 의미 기반 검색을 적용하여 자동으로 데이터를 증강할 수 있는 쉽고 빠른 데이터 증강 방법론을 제안한다. 제안하는 프로세스의 타당성과 증강된 문장의 효과를 검증하기 위해 사전학습 모델을 기반으로 비교 실험을 진행하였다. 실험 결과, 제안하는 프로세스를 적용하였을 시, 그렇지 않은 모델 대비 높은 폭의 성능 향상을 보였다.

  • PDF

Product Planning using Sentiment Analysis Technique Based on CNN-LSTM Model (CNN-LSTM 모델 기반의 감성분석을 이용한 상품기획 모델)

  • Kim, Do-Yeon;Jung, Jin-Young;Park, Won-Cheol;Park, Koo-Rack
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2021.07a
    • /
    • pp.427-428
    • /
    • 2021
  • 정보통신기술의 발달로 전자상거래의 증가와 소비자들의 제품에 대한 경험과 지식의 공유가 활발하게 진행됨에 따라 소비자는 제품을 구매하기 위한 자료수집, 활용을 진행하고 있다. 따라서 기업은 다양한 기능들을 반영한 제품이 치열하게 경쟁하고 있는 현 시장에서 우위를 점하고자 소비자 리뷰를 분석하여 소비자의 정확한 소비자의 요구사항을 분석하여 제품기획 프로세스에 반영하고자 텍스트마이닝(Text Mining) 기술과 딥러닝(Deep Learning) 기술을 통한 연구가 이루어지고 있다. 본 논문의 기초자료가 되는 데이터셋은 포털사이트의 구매사이트와 오픈마켓 사이트의 소비자 리뷰를 웹크롤링하고 자연어처리하여 진행한다. 감성분석은 딥러닝기술 중 CNN(Convolutional Neural Network), LSTM(Long Short Term Memory) 조합의 모델을 구현한다. 이는 딥러닝을 이용한 제품기획 프로세스로 소비자 요구사항 반영, 경제적인 측면, 제품기획 시간단축 등 긍정적인 영향을 미칠 것으로 기대한다.

  • PDF

Understanding the Categories and Characteristics of Depressive Moods in Chatbot Data (챗봇 데이터에 나타난 우울 담론의 범주와 특성의 이해)

  • Chin, HyoJin;Jung, Chani;Baek, Gumhee;Cha, Chiyoung;Choi, Jeonghoi;Cha, Meeyoung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.9
    • /
    • pp.381-390
    • /
    • 2022
  • Influenced by a culture that prefers non-face-to-face activity during the COVID-19 pandemic, chatbot usage is accelerating. Chatbots have been used for various purposes, not only for customer service in businesses and social conversations for fun but also for mental health. Chatbots are a platform where users can easily talk about their depressed moods because anonymity is guaranteed. However, most relevant research has been on social media data, especially Twitter data, and few studies have analyzed the commercially used chatbots data. In this study, we identified the characteristics of depressive discourse in user-chatbot interaction data by analyzing the chats, including the word 'depress,' using the topic modeling algorithm and the text-mining technique. Moreover, we compared its characteristics with those of the depressive moods in the Twitter data. Finally, we draw several design guidelines and suggest avenues for future research based on the study findings.

A Survey on the Latest Research Trends in Retrieval-Augmented Generation (검색 증강 생성(RAG) 기술의 최신 연구 동향에 대한 조사)

  • Eunbin Lee;Ho Bae
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.9
    • /
    • pp.429-436
    • /
    • 2024
  • As Large Language Models (LLMs) continue to advance, effectively harnessing their potential has become increasingly important. LLMs, trained on vast datasets, are capable of generating text across a wide range of topics, making them useful in applications such as content creation, machine translation, and chatbots. However, they often face challenges in generalization due to gaps in specific or specialized knowledge, and updating these models with the latest information post-training remains a significant hurdle. To address these issues, Retrieval-Augmented Generation (RAG) models have been introduced. These models enhance response generation by retrieving information from continuously updated external databases, thereby reducing the hallucination phenomenon often seen in LLMs while improving efficiency and accuracy. This paper presents the foundational architecture of RAG, reviews recent research trends aimed at enhancing the retrieval capabilities of LLMs through RAG, and discusses evaluation techniques. Additionally, it explores performance optimization and real-world applications of RAG in various industries. Through this analysis, the paper aims to propose future research directions for the continued development of RAG models.

Exploratory Research on Automating the Analysis of Scientific Argumentation Using Machine Learning (머신 러닝을 활용한 과학 논변 구성 요소 코딩 자동화 가능성 탐색 연구)

  • Lee, Gyeong-Geon;Ha, Heesoo;Hong, Hun-Gi;Kim, Heui-Baik
    • Journal of The Korean Association For Science Education
    • /
    • v.38 no.2
    • /
    • pp.219-234
    • /
    • 2018
  • In this study, we explored the possibility of automating the process of analyzing elements of scientific argument in the context of a Korean classroom. To gather training data, we collected 990 sentences from science education journals that illustrate the results of coding elements of argumentation according to Toulmin's argumentation structure framework. We extracted 483 sentences as a test data set from the transcription of students' discourse in scientific argumentation activities. The words and morphemes of each argument were analyzed using the Python 'KoNLPy' package and the 'Kkma' module for Korean Natural Language Processing. After constructing the 'argument-morpheme:class' matrix for 1,473 sentences, five machine learning techniques were applied to generate predictive models relating each sentences to the element of argument with which it corresponded. The accuracy of the predictive models was investigated by comparing them with the results of pre-coding by researchers and confirming the degree of agreement. The predictive model generated by the k-nearest neighbor algorithm (KNN) demonstrated the highest degree of agreement [54.04% (${\kappa}=0.22$)] when machine learning was performed with the consideration of morpheme of each sentence. The predictive model generated by the KNN exhibited higher agreement [55.07% (${\kappa}=0.24$)] when the coding results of the previous sentence were added to the prediction process. In addition, the results indicated importance of considering context of discourse by reflecting the codes of previous sentences to the analysis. The results have significance in that, it showed the possibility of automating the analysis of students' argumentation activities in Korean language by applying machine learning.

Stock prediction using combination of BERT sentiment Analysis and Macro economy index

  • Jang, Euna;Choi, HoeRyeon;Lee, HongChul
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.5
    • /
    • pp.47-56
    • /
    • 2020
  • The stock index is used not only as an economic indicator for a country, but also as an indicator for investment judgment, which is why research into predicting the stock index is ongoing. The task of predicting the stock price index involves technical, basic, and psychological factors, and it is also necessary to consider complex factors for prediction accuracy. Therefore, it is necessary to study the model for predicting the stock price index by selecting and reflecting technical and auxiliary factors that affect the fluctuation of the stock price according to the stock price. Most of the existing studies related to this are forecasting studies that use news information or macroeconomic indicators that create market fluctuations, or reflect only a few combinations of indicators. In this paper, this we propose to present an effective combination of the news information sentiment analysis and various macroeconomic indicators in order to predict the US Dow Jones Index. After Crawling more than 93,000 business news from the New York Times for two years, the sentiment results analyzed using the latest natural language processing techniques BERT and NLTK, along with five macroeconomic indicators, gold prices, oil prices, and five foreign exchange rates affecting the US economy Combination was applied to the prediction algorithm LSTM, which is known to be the most suitable for combining numeric and text information. As a result of experimenting with various combinations, the combination of DJI, NLTK, BERT, OIL, GOLD, and EURUSD in the DJI index prediction yielded the smallest MSE value.

Understanding Sexual Identity-related Concerns through the Analysis of Questions on a Social Q&A Site (소셜 Q&A 사이트의 질문 분석을 통한 청소년의 성 정체성(sexual identity) 고민에 대한 이해)

  • Zhu, Yongjun;Nam, Seojin;Yi, Dajeong;Yi, Yong Jeong
    • Journal of Korean Library and Information Science Society
    • /
    • v.51 no.4
    • /
    • pp.101-119
    • /
    • 2020
  • The study aims to understand major topics and concerns of gender identity-related questions expressed by the users of the NAVER social Q&A site. To achieve this goal, we analyzed 2,120 questions created from 2010 to 2018 using natural language- and information retrieval-based methods. Results indicated that the major topics discussed by the users include interpersonal relationships, doubts about gender identity, sexual orientation, feelings and relationships, and concerns about gender identity. In addition, users mainly expressed concerns regarding general issues of gender identity; sexual orientation; negative cognition about gender identity; confession, coming-out, homosexuality; future, heterosexual relationships, military enlistment; and causes of gender identity confusion. The present study effectively derives information needs from real-world concerns about sexual identity by employing topic modeling techniques, and by comparing the advantages of exact match and tf-idf-based information retrieval methods extends methodology of Library and Information Science. Further, it has contributed to the academic maturity of the study of information behavior by observing the information needs or information-seeking behaviors of online community users with specific interests.

Multi-source information integration framework using self-supervised learning-based language model (자기 지도 학습 기반의 언어 모델을 활용한 다출처 정보 통합 프레임워크)

  • Kim, Hanmin;Lee, Jeongbin;Park, Gyudong;Sohn, Mye
    • Journal of Internet Computing and Services
    • /
    • v.22 no.6
    • /
    • pp.141-150
    • /
    • 2021
  • Based on Artificial Intelligence technology, AI-enabled warfare is expected to become the main issue in the future warfare. Natural language processing technology is a core technology of AI technology, and it can significantly contribute to reducing the information burden of underrstanidng reports, information objects and intelligences written in natural language by commanders and staff. In this paper, we propose a Language model-based Multi-source Information Integration (LAMII) framework to reduce the information overload of commanders and support rapid decision-making. The proposed LAMII framework consists of the key steps of representation learning based on language models in self-supervsied way and document integration using autoencoders. In the first step, representation learning that can identify the similar relationship between two heterogeneous sentences is performed using the self-supervised learning technique. In the second step, using the learned model, documents that implies similar contents or topics from multiple sources are found and integrated. At this time, the autoencoder is used to measure the information redundancy of the sentences in order to remove the duplicate sentences. In order to prove the superiority of this paper, we conducted comparison experiments using the language models and the benchmark sets used to evaluate their performance. As a result of the experiment, it was demonstrated that the proposed LAMII framework can effectively predict the similar relationship between heterogeneous sentence compared to other language models.