• Title/Summary/Keyword: Korean text classification

Search Result 413, Processing Time 0.024 seconds

Fast offline transformer-based end-to-end automatic speech recognition for real-world applications

  • Oh, Yoo Rhee;Park, Kiyoung;Park, Jeon Gue
    • ETRI Journal
    • /
    • v.44 no.3
    • /
    • pp.476-490
    • /
    • 2022
  • With the recent advances in technology, automatic speech recognition (ASR) has been widely used in real-world applications. The efficiency of converting large amounts of speech into text accurately with limited resources has become more vital than ever. In this study, we propose a method to rapidly recognize a large speech database via a transformer-based end-to-end model. Transformers have improved the state-of-the-art performance in many fields. However, they are not easy to use for long sequences. In this study, various techniques to accelerate the recognition of real-world speeches are proposed and tested, including decoding via multiple-utterance-batched beam search, detecting end of speech based on a connectionist temporal classification (CTC), restricting the CTC-prefix score, and splitting long speeches into short segments. Experiments are conducted with the Librispeech dataset and the real-world Korean ASR tasks to verify the proposed methods. From the experiments, the proposed system can convert 8 h of speeches spoken at real-world meetings into text in less than 3 min with a 10.73% character error rate, which is 27.1% relatively lower than that of conventional systems.

Transformation-based Learning for Korean Comparative Sentence Classification (한국어 비교 문장 유형 분류를 위한 변환 기반 학습 기법)

  • Yang, Seon;Ko, Young-Joong
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.2
    • /
    • pp.155-160
    • /
    • 2010
  • This paper proposes a method for Korean comparative sentence classification which is a part of comparison mining. Comparison mining, one area of text mining, analyzes comparative relations from the enormous amount of text documents. Three-step process is needed for comparison mining - 1) identifying comparative sentences in the text documents, 2) classifying those sentences into several classes, 3) analyzing comparative relations per each comparative class. This paper aims at the second task. In this paper, we use transformation-based learning (TBL) technique which is a well-known learning method in the natural language processing. In our experiment, we classify comparative sentences into seven classes using TBL and achieve an accuracy of 80.01%.

Classification Techniques for XML Document Using Text Mining (텍스트 마이닝을 이용한 XML 문서 분류 기술)

  • Kim Cheon-Shik;Hong You-Sik
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.2 s.40
    • /
    • pp.15-23
    • /
    • 2006
  • Millions of documents are already on the Internet, and new documents are being formed all the time. This poses a very important problem in the management and querying of documents to classify them on the Internet by the most suitable means. However, most users have been using the document classification method based on a keyword. This method does not classify documents efficiently, and there is a weakness in the category of document that includes meaning. Document classification by a person can be very correct sometimes and often times is required. Therefore, in this paper, We wish to classify documents by using a neural network algorithm and C4.5 algorithms. We used resume data forming by XML for a document classification experiment. The result showed excellent possibilities in the document category. Therefore, We expect an applicable solution for various document classification problems.

  • PDF

Study on Model Case of Ideal Digitization of Korean Ancient Books (국학고전자료의 디지털화를 위한 모범적인 방안 연구)

  • Lee, Hee-Jae
    • Journal of the Korean Society for information Management
    • /
    • v.22 no.1 s.55
    • /
    • pp.105-123
    • /
    • 2005
  • The most of all, this study is planned to search an ideal methods to develop the digital library system for our korean ancient books for their safe preservation and, at the same time, for their perusal of transcendental time and space : first. to offer the various access points like traditional oriental Four parts Classics classification, current subject classification and index keyword, etc. : second, to program a digital library system using MARC or XML, but with all bibliographic descriptive elements as possible; third, to prepare the more easy annotated bibliography and index for users' better comprehension, and last, to build original text database for practical reading to avoid the damage of original text. This type of korean ancient books digital library will be developed to the real international bibliographic control by networking enter the same kinds of internal and external organizations.

Verification of educational goal of reading area in Korean SAT through natural language processing techniques (대학수학능력시험 독서 영역의 교육 목표를 위한 자연어처리 기법을 통한 검증)

  • Lee, Soomin;Kim, Gyeongmin;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.1
    • /
    • pp.81-88
    • /
    • 2022
  • The major educational goal of reading part, which occupies important portion in Korean language in Korean SAT, is to evaluated whether a given text can be fully understood. Therefore given questions in the exam must be able to solely solvable by given text. In this paper we developed a datatset based on Korean SAT's reading part in order to evaluate whether a deep learning language model can classify if the given question is true or false, which is a binary classification task in NLP. In result, by applying language model solely according to the passages in the dataset, we were able to acquire better performance than 59.2% in F1 score for human performance in most of language models, that KoELECTRA scored 62.49% in our experiment. Also we proved that structural limit of language models can be eased by adjusting data preprocess.

An effective approach to generate Wikipedia infobox of movie domain using semi-structured data

  • Bhuiyan, Hanif;Oh, Kyeong-Jin;Hong, Myung-Duk;Jo, Geun-Sik
    • Journal of Internet Computing and Services
    • /
    • v.18 no.3
    • /
    • pp.49-61
    • /
    • 2017
  • Wikipedia infoboxes have emerged as an important structured information source on the web. To compose infobox for an article, considerable amount of manual effort is required from an author. Due to this manual involvement, infobox suffers from inconsistency, data heterogeneity, incompleteness, schema drift etc. Prior works attempted to solve those problems by generating infobox automatically based on the corresponding article text. However, there are many articles in Wikipedia that do not have enough text content to generate infobox. In this paper, we present an automated approach to generate infobox for movie domain of Wikipedia by extracting information from several sources of the web instead of relying on article text only. The proposed methodology has been developed using semantic relations of article content and available semi-structured information of the web. It processes the article text through some classification processes to identify the template from the large pool of template list. Finally, it extracts the information for the corresponding template attributes from web and thus generates infobox. Through a comprehensive experimental evaluation the proposed scheme was demonstrated as an effective and efficient approach to generate Wikipedia infobox.

Privacy-Preserving Language Model Fine-Tuning Using Offsite Tuning (프라이버시 보호를 위한 오프사이트 튜닝 기반 언어모델 미세 조정 방법론)

  • Jinmyung Jeong;Namgyu Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.4
    • /
    • pp.165-184
    • /
    • 2023
  • Recently, Deep learning analysis of unstructured text data using language models, such as Google's BERT and OpenAI's GPT has shown remarkable results in various applications. Most language models are used to learn generalized linguistic information from pre-training data and then update their weights for downstream tasks through a fine-tuning process. However, some concerns have been raised that privacy may be violated in the process of using these language models, i.e., data privacy may be violated when data owner provides large amounts of data to the model owner to perform fine-tuning of the language model. Conversely, when the model owner discloses the entire model to the data owner, the structure and weights of the model are disclosed, which may violate the privacy of the model. The concept of offsite tuning has been recently proposed to perform fine-tuning of language models while protecting privacy in such situations. But the study has a limitation that it does not provide a concrete way to apply the proposed methodology to text classification models. In this study, we propose a concrete method to apply offsite tuning with an additional classifier to protect the privacy of the model and data when performing multi-classification fine-tuning on Korean documents. To evaluate the performance of the proposed methodology, we conducted experiments on about 200,000 Korean documents from five major fields, ICT, electrical, electronic, mechanical, and medical, provided by AIHub, and found that the proposed plug-in model outperforms the zero-shot model and the offsite model in terms of classification accuracy.

Self-Supervised Document Representation Method

  • Yun, Yeoil;Kim, Namgyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.5
    • /
    • pp.187-197
    • /
    • 2020
  • Recently, various methods of text embedding using deep learning algorithms have been proposed. Especially, the way of using pre-trained language model which uses tremendous amount of text data in training is mainly applied for embedding new text data. However, traditional pre-trained language model has some limitations that it is hard to understand unique context of new text data when the text has too many tokens. In this paper, we propose self-supervised learning-based fine tuning method for pre-trained language model to infer vectors of long-text. Also, we applied our method to news articles and classified them into categories and compared classification accuracy with traditional models. As a result, it was confirmed that the vector generated by the proposed model more accurately expresses the inherent characteristics of the document than the vectors generated by the traditional models.

A Classification and Selection Method of Emotion Based on Classifying Emotion Terms by Users (사용자의 정서 단어 분류에 기반한 정서 분류와 선택 방법)

  • Rhee, Shin-Young;Ham, Jun-Seok;Ko, Il-Ju
    • Science of Emotion and Sensibility
    • /
    • v.15 no.1
    • /
    • pp.97-104
    • /
    • 2012
  • Recently, a big text data has been produced by users, an opinion mining to analyze information and opinion about users is becoming a hot issue. Of the opinion mining, especially a sentiment analysis is a study for analysing emotions such as a positive, negative, happiness, sadness, and so on analysing personal opinions or emotions for commercial products, social issues and opinions of politician. To analyze the sentiment analysis, previous studies used a mapping method setting up a distribution of emotions using two dimensions composed of a valence and arousal. But previous studies set up a distribution of emotions arbitrarily. In order to solve the problem, we composed a distribution of 12 emotions through carrying out a survey using Korean emotion words list. Also, certain emotional states on two dimension overlapping multiple emotions, we proposed a selection method with Roulette wheel method using a selection probability. The proposed method shows to classify a text into emotion extracting emotion terms from a text.

  • PDF

Organization and use of theses collections in university libraries (학위논문의 정리와 이용)

  • 최달현;변우열
    • Journal of Korean Library and Information Science Society
    • /
    • v.12
    • /
    • pp.161-198
    • /
    • 1985
  • This paper is a study of the organization and use of theses collections in university libraries of Korea. A questionnaire consisted of 31 questions on 6 items was sent to 44 university libraries of which 40 libraries responded. Results of the study can be summarized as follows: 1. Figures concerning registration of theses can be tabulated as follows. 2. In differentiation of oriental and occidental theses, 20 libraries (50%) depend on the basis of the text language. 3. Thirty-four libraries (85%) classify the theses and 27 (80%) of them use the same tables with book classification schedules. For classification level, 17 libraries (48.6%) classify them in section numbers whereas 13 (37.1%) in sub-sections. 4. Catalog or index cards of theses are made in 35 libraries (87.5%) of which 20 libraries are using the second level of bibliographic description. 5. Roman alphabets in a title are described a such 27 libraries (67.5%). 6. Most of respondents are preparing author, title and classified catalog cards for users. The research reveals that only 8 libraries are giving subject headings to the theses. 7. Twenty-three libraries (63.9%) have theses catalogs in separation from their book catalogs. 8. Most helpful bibliographic elements in an entry for users are reported to be author, title, date and notes. In general, theses collections have many different features in various aspects compared with book materials. Therefore it is desirable to process the former differently with the latter. Firstly, it would be more convenient to register theses on the different register from the book register. Secondly, minute classification of theses would be necessary for their users. thirdly, text language is the common basis of discriminating oriental materials and occidental ones. Fourthly, a simple catalog would be quite good enough to use theses collection, for most helpful elements in an entry are limited to author, title, date and notes. Fifthly, it is strongly recommendable to transcribe all the roman alphabets on the titles into Korean alphabets. Sixthly, the research revealed that our library would needs to develop subject heading work which is for behind other library works.

  • PDF