• Title/Summary/Keyword: large-language model

Search Result 370, Processing Time 0.023 seconds

A Bidirectional Korean-Japanese Statistical Machine Translation System by Using MOSES (MOSES를 이용한 한/일 양방향 통계기반 자동 번역 시스템)

  • Lee, Kong-Joo;Lee, Song-Wook;Kim, Jee-Eun
    • Journal of Advanced Marine Engineering and Technology
    • /
    • v.36 no.5
    • /
    • pp.683-693
    • /
    • 2012
  • Recently, statistical machine translation (SMT) has received many attention with ease of its implementation and maintenance. The goal of our works is to build bidirectional Korean-Japanese SMT system by using MOSES [1] system. We use Korean-Japanese bilingual corpus which is aligned per sentence to train the translation model and use a large raw corpus in each language to train each language model. The proposed system shows results comparable to those of a rule-based machine translation system. Most of errors are caused by noises occurred in each processing stage.

The method of using database technology to process rules of Rule-Based System

  • Zheng, Baowei;Yeo, Jeong-Mo
    • Journal of information and communication convergence engineering
    • /
    • v.8 no.1
    • /
    • pp.89-94
    • /
    • 2010
  • The most important of rule-base system is the knowledge base that determines the power of rule-base system. The important form of this knowledge is how to descript kinds of rules. The Rule-Base System (RBS) has been using in many field that need reflect quickly change of business rules in management system. As far, when develop the Rule-Based System, we must make a rule engine with a general language. There are three disadvantage of in this developed method. First, while there are many data that must be processed in the system, the speed of processing data will become very slow so that we cannot accept it. Second, we cannot change the current system to make it adaptive to changes of business rules as quickly as possible. Third, large data make the rule engine become very complex. Therefore, in this paper, we propose the two important methods of raising efficiency of Rule-Base System. The first method refers to using the Relational database technology to process the rules of the Rule-Base System, the second method refers to a algorithm of according to Quine McCluskey formula compress the rows of rule table. Because the expressive languages of rule are still remaining many problems, we will introduce a new expressive language, which is Rule-Base Data Model short as RBDM in this paper.

Recognition of Continuous Spoken Korean Language using HMM and Level Building (은닉 마르코프 모델과 레벨 빌딩을 이용한 한국어 연속 음성 인식)

  • 김경현;김상균;김항준
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.35C no.11
    • /
    • pp.63-75
    • /
    • 1998
  • Since many co-articulation problems are occurring in continuous spoken Korean language, several researches use words as a basic recognition unit. Though the word unit can solve this problem, it requires much memory and has difficulty fitting an input speech in a word list. In this paper, we propose an hidden Markov model(HMM) based recognition model that is an interconnection network of word HMMs for a syntax of sentences. To match suitably the input sentence into the continuous word list in the network, we use a level building search algorithm. This system represents the large sentence set with a relatively small memory and also has good extensibility. The experimental result of an airplane reservation system shows that it is proper method for a practical recognition system.

  • PDF

A Study on an Automatic Model Creation Tool for Applying Structured UML Models in Software Development (소프트웨어 개발에 구조화된 UML 모델을 적용하기 위한 자동 모델 생성 도구에 관한 연구)

  • Seungmo Jung;Woojin Lee
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.12
    • /
    • pp.683-690
    • /
    • 2024
  • Recently, large-scale software development has been using the highly readable Unified Modeling Language (UML) development method. The use of standardized UML models in software development improves software quality by resolving unclear communication. However, in existing software development, a code-centric development method is applied rather than a model-centric development method. As a result, problems such as increased model creation work time arise because UML models must be manually handled when developing model-centric software using existing software. In addition, depending on the developer's understanding of the model, the ability to use modeling tools, and the complexity of the software structure, the time required to create the model increases further. The increase in model creation work time is a factor that increases the overall software development time. Therefore, this paper proposes an automatic model creation tool for structurally applying UML models to the development of naval combat system software. The Automatic Model Creation Tool provides features that automatically generate model structures and UML models needed for modeling tasks. Using the method proposed in this paper, it has the advantage of structurally applying UML models through automation functions and efficiency in reducing model creation work time.

A Knowledge Graph-based Chatbot to Prevent the Leakage of LLM User's Sensitive Information (LLM 사용자의 민감정보 유출 방지를 위한 지식그래프 기반 챗봇)

  • Keedong Yoo
    • Knowledge Management Research
    • /
    • v.25 no.2
    • /
    • pp.1-18
    • /
    • 2024
  • With the increasing demand for and utilization of large language models (LLMs), the risk of user sensitive information being inputted and leaked during the use of LLMs also escalates. Typically recognized as a tool for mitigating the hallucination issues of LLMs, knowledge graphs, constructed independently from LLMs, can store and manage sensitive user information separately, thereby minimizing the potential for data breaches. This study, therefore, presents a knowledge graph-based chatbot that transforms user-inputted natural language questions into queries appropriate for the knowledge graph using LLMs, subsequently executing these queries and extracting the results. Furthermore, to evaluate the functional validity of the developed knowledge graph-based chatbot, performance tests are conducted to assess the comprehension and adaptability to existing knowledge graphs, the capability to create new entity classes, and the accessibility of LLMs to the knowledge graph content.

On the Development of a Large-Vocabulary Continuous Speech Recognition System for the Korean Language (대용량 한국어 연속음성인식 시스템 개발)

  • Choi, In-Jeong;Kwon, Oh-Wook;Park, Jong-Ryeal;Park, Yong-Kyu;Kim, Do-Yeong;Jeong, Ho-Young;Un, Chong-Kwan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.14 no.5
    • /
    • pp.44-50
    • /
    • 1995
  • This paper describes a large-vocabulary continuous speech recognition system using continuous hidden Markov models for the Korean language. To improve the performance of the system, we study on the selection of speech modeling units, inter-word modeling, search algorithm, and grammars. We used triphones as basic speech modeling units, generalized triphones and function word-dependent phones are used to improve the trainability of speech units and to reduce errors in function words. Silence between words is optionally inserted by using a silence model and a null transition. Word pair grammar and bigram model based oil word classes are used. Also we implement a search algorithm to find N-best candidate sentences. A postprocessor reorders the N-best sentences using word triple grammar, selects the most likely sentence as the final recognition result, and finally corrects trivial errors related with postpositions. In recognition tests using a 3,000-word continuous speech database, the system attained $93.1\%$ word recognition accuracy and $73.8\%$ sentence recognition accuracy using word triple grammar in postprocessing.

  • PDF

LLM-based Question Generation Learning System for Improve Users' Literacy Skills (사용자의 문해력 향상을 위한 LLM기반 문제 생성 시스템)

  • Ji-Sung Park;Seung-Min Park
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.19 no.6
    • /
    • pp.1243-1248
    • /
    • 2024
  • Due to the recent development of video media and the popularity of short and stimulating content, the frequency of exposure to text media such as books and newspapers has decreased. As a result, literacy skills among consumers in their teens and twenties have significantly declined, leading to various social problems. In this study, we propose a solution by developing a problem generation system based on a Large Language Model (LLM) aimed at improving users' literacy skills. The system allows users to select a question type (reading, speaking, listening) and input part of a desired text (e.g., news articles, literary/non-literary passages, research papers), based on which it automatically generates the selected type of question. Additionally, when the user inputs an answer, the system generates feedback on the correctness of the response along with a detailed explanation. The system's high accessibility and personalized problem generation make it distinctly different from existing literacy education methods and is expected to present a new direction for literacy learning.

Generative LLM-based Automatic Classification and Annotation of the Weather environment in Image datasets (생성형 대형 언어 모델(LLM) 활용 영상 데이터의 날씨 환경 자동 인식 및 분류 방법)

  • Hyeongjin Ju;Hanbin Song;Shiho Kim
    • Journal of Platform Technology
    • /
    • v.12 no.5
    • /
    • pp.71-78
    • /
    • 2024
  • We proposed a method for generating image-text pair data of image data containing climatic conditions by utilizing a generative large language model (LLM). We collected images representing various severe weather conditions and implemented a method to describe the climatic conditions of the images with text. The proposed technique supports accurate classification of the severe weather conditions necessary to determine autonomous driving conditions by analyzing the frequency of severe weather and its impact on autonomous vehicles. This demonstrates that the quality of weather data can be improved, and the performance of weather prediction systems can be enhanced, significantly increasing the safety and reliability of autonomous driving technology. Experimental results showed that the proposed method achieved the highest performance across three quantitative metrics: Precision, Recall, and F1-score. The generation of image-text pair data using LLM is expected to greatly improve the quality of weather data and the safety of autonomous vehicles, playing a crucial role in the advancement of autonomous driving technology.

  • PDF

Building robust Korean speech recognition model by fine-tuning large pretrained model (대형 사전훈련 모델의 파인튜닝을 통한 강건한 한국어 음성인식 모델 구축)

  • Changhan Oh;Cheongbin Kim;Kiyoung Park
    • Phonetics and Speech Sciences
    • /
    • v.15 no.3
    • /
    • pp.75-82
    • /
    • 2023
  • Automatic speech recognition (ASR) has been revolutionized with deep learning-based approaches, among which self-supervised learning methods have proven to be particularly effective. In this study, we aim to enhance the performance of OpenAI's Whisper model, a multilingual ASR system on the Korean language. Whisper was pretrained on a large corpus (around 680,000 hours) of web speech data and has demonstrated strong recognition performance for major languages. However, it faces challenges in recognizing languages such as Korean, which is not major language while training. We address this issue by fine-tuning the Whisper model with an additional dataset comprising about 1,000 hours of Korean speech. We also compare its performance against a Transformer model that was trained from scratch using the same dataset. Our results indicate that fine-tuning the Whisper model significantly improved its Korean speech recognition capabilities in terms of character error rate (CER). Specifically, the performance improved with increasing model size. However, the Whisper model's performance on English deteriorated post fine-tuning, emphasizing the need for further research to develop robust multilingual models. Our study demonstrates the potential of utilizing a fine-tuned Whisper model for Korean ASR applications. Future work will focus on multilingual recognition and optimization for real-time inference.

Deletion-Based Sentence Compression Using Sentence Scoring Reflecting Linguistic Information (언어 정보가 반영된 문장 점수를 활용하는 삭제 기반 문장 압축)

  • Lee, Jun-Beom;Kim, So-Eon;Park, Seong-Bae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.3
    • /
    • pp.125-132
    • /
    • 2022
  • Sentence compression is a natural language processing task that generates concise sentences that preserves the important meaning of the original sentence. For grammatically appropriate sentence compression, early studies utilized human-defined linguistic rules. Furthermore, while the sequence-to-sequence models perform well on various natural language processing tasks, such as machine translation, there have been studies that utilize it for sentence compression. However, for the linguistic rule-based studies, all rules have to be defined by human, and for the sequence-to-sequence model based studies require a large amount of parallel data for model training. In order to address these challenges, Deleter, a sentence compression model that leverages a pre-trained language model BERT, is proposed. Because the Deleter utilizes perplexity based score computed over BERT to compress sentences, any linguistic rules and parallel dataset is not required for sentence compression. However, because Deleter compresses sentences only considering perplexity, it does not compress sentences by reflecting the linguistic information of the words in the sentences. Furthermore, since the dataset used for pre-learning BERT are far from compressed sentences, there is a problem that this can lad to incorrect sentence compression. In order to address these problems, this paper proposes a method to quantify the importance of linguistic information and reflect it in perplexity-based sentence scoring. Furthermore, by fine-tuning BERT with a corpus of news articles that often contain proper nouns and often omit the unnecessary modifiers, we allow BERT to measure the perplexity appropriate for sentence compression. The evaluations on the English and Korean dataset confirm that the sentence compression performance of sentence-scoring based models can be improved by utilizing the proposed method.