• Title/Summary/Keyword: Multilingual

Search Result 173, Processing Time 0.026 seconds

KorQuAD 2.0: Korean QA Dataset for Web Document Machine Comprehension (KorQuAD 2.0: 웹문서 기계독해를 위한 한국어 질의응답 데이터셋)

  • Kim, Youngmin;Lim, Seungyoung;Lee, Hyunjeong;Park, Soyoon;Kim, Myungji
    • Annual Conference on Human and Language Technology
    • /
    • 2019.10a
    • /
    • pp.97-102
    • /
    • 2019
  • KorQuAD 2.0은 총 100,000+ 쌍으로 구성된 한국어 질의응답 데이터셋이다. 기존 질의응답 표준 데이터인 KorQuAD 1.0과의 차이점은 크게 세가지가 있는데 첫 번째는 주어지는 지문이 한두 문단이 아닌 위키백과 한 페이지 전체라는 점이다. 두 번째로 지문에 표와 리스트도 포함되어 있기 때문에 HTML tag로 구조화된 문서에 대한 이해가 필요하다. 마지막으로 답변이 단어 혹은 구의 단위뿐 아니라 문단, 표, 리스트 전체를 포괄하는 긴 영역이 될 수 있다. Baseline 모델로 구글이 오픈소스로 공개한 BERT Multilingual을 활용하여 실험한 결과 F1 스코어 46.0%의 성능을 확인하였다. 이는 사람의 F1 점수 85.7%에 비해 매우 낮은 점수로, 본 데이터가 도전적인 과제임을 알 수 있다. 본 데이터의 공개를 통해 평문에 국한되어 있던 질의응답의 대상을 다양한 길이와 형식을 가진 real world task로 확장하고자 한다.

  • PDF

Korean Text to Gloss: Self-Supervised Learning approach

  • Thanh-Vu Dang;Gwang-hyun Yu;Ji-yong Kim;Young-hwan Park;Chil-woo Lee;Jin-Young Kim
    • Smart Media Journal
    • /
    • v.12 no.1
    • /
    • pp.32-46
    • /
    • 2023
  • Natural Language Processing (NLP) has grown tremendously in recent years. Typically, bilingual, and multilingual translation models have been deployed widely in machine translation and gained vast attention from the research community. On the contrary, few studies have focused on translating between spoken and sign languages, especially non-English languages. Prior works on Sign Language Translation (SLT) have shown that a mid-level sign gloss representation enhances translation performance. Therefore, this study presents a new large-scale Korean sign language dataset, the Museum-Commentary Korean Sign Gloss (MCKSG) dataset, including 3828 pairs of Korean sentences and their corresponding sign glosses used in Museum-Commentary contexts. In addition, we propose a translation framework based on self-supervised learning, where the pretext task is a text-to-text from a Korean sentence to its back-translation versions, then the pre-trained network will be fine-tuned on the MCKSG dataset. Using self-supervised learning help to overcome the drawback of a shortage of sign language data. Through experimental results, our proposed model outperforms a baseline BERT model by 6.22%.

Handwritten Indic Digit Recognition using Deep Hybrid Capsule Network

  • Mohammad Reduanul Haque;Rubaiya Hafiz;Mohammad Zahidul Islam;Mohammad Shorif Uddin
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.2
    • /
    • pp.89-94
    • /
    • 2024
  • Indian subcontinent is a birthplace of multilingual people where documents such as job application form, passport, number plate identification, and so forth is composed of text contents written in different languages/scripts. These scripts may be in the form of different indic numerals in a single document page. Due to this reason, building a generic recognizer that is capable of recognizing handwritten indic digits written by diverse writers is needed. Also, a lot of work has been done for various non-Indic numerals particularly, in case of Roman, but, in case of Indic digits, the research is limited. Moreover, most of the research focuses with only on MNIST datasets or with only single datasets, either because of time restraints or because the model is tailored to a specific task. In this work, a hybrid model is proposed to recognize all available indic handwritten digit images using the existing benchmark datasets. The proposed method bridges the automatically learnt features of Capsule Network with hand crafted Bag of Feature (BoF) extraction method. Along the way, we analyze (1) the successes (2) explore whether this method will perform well on more difficult conditions i.e. noise, color, affine transformations, intra-class variation, natural scenes. Experimental results show that the hybrid method gives better accuracy in comparison with Capsule Network.

Current Status of Archiving Activities of Multicultural Service Agencies and Organizations in Dae Gu Metropolitan City (대구지역 다문화 유관 기관의 아카이빙 활동 현황에 관한 연구)

  • Cho, Yong Wan
    • Journal of Korean Library and Information Science Society
    • /
    • v.47 no.2
    • /
    • pp.125-155
    • /
    • 2016
  • The aim of this study is to investigate current status of archiving activities related to producing, collecting and managing information resources of multicultural agencies and organizations in Dae Gu. To do this, 12 agencies and organizations including Multicultural Team of Dae Gu, multicultural family support centers, foreign worker support center, NGOs for immigrants and public libraries were visited. As a result, these agencies and organizations have struggled for producing information resources through online and off-line, collecting information resources from external bodies and managing information resources like official documents, counseling reports, multicultural books and artifacts. But there were problems in archiving information resources. In order to solve problems, first, multicultural agencies and organizations should reinforce responsibilities to produce, collect and manage information resources. Second, public libraries should actively try to collect and organize information resources from these agencies and organizations. Finally, cooperative archiving activities between multicultural agencies and organizations and public libraries are needed.

The Method of the Evaluation of Verbal Lexical-Semantic Network Using the Automatic Word Clustering System (단어클러스터링 시스템을 이용한 어휘의미망의 활용평가 방안)

  • Kim, Hae-Gyung;Song, Mi-Young
    • Korean Journal of Oriental Medicine
    • /
    • v.12 no.3 s.18
    • /
    • pp.1-15
    • /
    • 2006
  • For the recent several years, there has been much interest in lexical semantic network. However, it seems to be very difficult to evaluate the effectiveness and correctness of it and invent the methods for applying it into various problem domains. In order to offer the fundamental ideas about how to evaluate and utilize lexical semantic networks, we developed two automatic word clustering systems, which are called system A and system B respectively. 68,455,856 words were used to learn both systems. We compared the clustering results of system A to those of system B which is extended by the lexical-semantic network. The system B is extended by reconstructing the feature vectors which are used the elements of the lexical-semantic network of 3,656 '-ha' verbs. The target data is the 'multilingual Word Net-CoreNet'.When we compared the accuracy of the system A and system B, we found that system B showed the accuracy of 46.6% which is better than that of system A, 45.3%.

  • PDF

Simulation of Disaster Broadcast Service Using Terrestrial UHD Additional Data (지상파 UHD 부가 데이터를 활용한 재난방송 서비스 시뮬레이션)

  • Kwak, Chunsub;Lee, Man-Kyu;Lee, Hyun-Ji
    • The Journal of the Korea Contents Association
    • /
    • v.20 no.5
    • /
    • pp.58-68
    • /
    • 2020
  • In this paper, we simulated disaster broadcast service using terrestrial UHD additional data. We looked at nine functions(warning alarm, location-based information, multilingual, re-viewing the title of breaking news, evacuation knack video, CCTV, clip video, wake up, automatic channel switching) by referring to ATSC 3.0 standards. Analyzing the advantages and problems of the proposed disaster broadcast. Expert interviews were conducted after a demonstration and explanation of terrestrial UHD disaster broadcast. They said that the problems with the technology were difficulty in using, obstruction of viewing, content creation problems, and problems with advertisers. In addition to the lack of simulation-based research, this research has provided an insight into the technology by showing it in a visible way. This helped shape the advantages and problems of technology.

A Study on the Multilingual Speech Recognition for On-line International Game (온라인 다국적 게임을 위한 다국어 혼합 음성 인식에 관한 연구)

  • Kim, Suk-Dong;Kang, Heung-Soon;Woo, In-Sung;Shin, Chwa-Cheul;Yoon, Chun-Duk
    • Journal of Korea Game Society
    • /
    • v.8 no.4
    • /
    • pp.107-114
    • /
    • 2008
  • The requests for speech-recognition for multi-language in field of game and the necessity of multi-language system, which expresses one phonetic model from many different kind of language phonetics, has been increased in field of game industry. Here upon, the research regarding development of multi-national language system which can express speeches, that is consist of various different languages, into only one lexical model is needed. In this paper is basic research for establishing integrated system from multi-language lexical model, and it shows the system which recognize Korean and English speeches into IPA(International Phonetic Alphabet). We focused on finding the IPA model which is satisfied with Korean and English phoneme one simutaneously. As a result, we could get the 90.62% of Korean speech-recognition rate, also 91.71% of English speech-recognition rate.

  • PDF

Design and Implementation of Conversion System Between ISO/IEC 10646 and Multi-Byte Code Set (ISO/IEC 10646과 멀티바이트 코드 세트간의 변환시스템의 설계 및 구현)

  • Kim, Chul
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.4
    • /
    • pp.319-324
    • /
    • 2018
  • In this paper, we designed and implemented a code conversion method between ISO/IEC 10646 and the multi-byte code set. The Universal Multiple-Octet Coded Character Set(UCS) provides codes for more than 65,000 characters, huge increase over ASCII's code capacity of 128 characters. It is applicable to the representation, transmission, interchange, processing, storage, input and presentation of the written form of the language throughout the world. Therefore, it is so important to guide on code conversion methods to their customers during customer systems are migrated to the environment which the UCS code system is used and/or the current code systems, i.e., ASCII PC code and EBCDIC host code, are used with the UCS together. Code conversion utility including the mapping table between the UCS and IBM new host code is shown for the purpose of the explanation of code conversion algorithm and its implementation in the system. The programs are successfully executed in the real system environments and so can be delivered to the customer during its migration stage from the UCS to the current IBM code system and vice versa.

Classification and Evaluation of Service Requirements in Mobile Tourism Application Using Kano Model and AHP

  • Choedon, Tenzin;Lee, Young-Chan
    • The Journal of Information Systems
    • /
    • v.27 no.1
    • /
    • pp.43-65
    • /
    • 2018
  • Purpose The emergence of mobile applications has simplified our life in various ways. Regarding tourism activities, mobile applications are already efficient in providing personalized tourism related information and are very much effective in booking hotels, flights, etc. However, there are very few studies on classifying the actual service requirements and improving the customer satisfaction in mobile tourism applications. The purpose of this study is to implement a practical mobile tourism application. To serve the purpose, we classify and categorize the service requirement of mobile tourism applications in Korea. We employed Kano model and analytic hierarchy process (AHP). Specifically, we conducted a focus group study to find out the service requirements in mobile tourism applications. Design/methodology/approach The data for this study were collected from Koreans and Foreigners who has the experience using mobile tourism applications. Participants needed to be familiar with mobile tourism applications because such users may be more aware of the mobile tourism applications services. We analyzed 147 valid data using Kano model and conducted AHP analysis on five experts in the field of tourism using Expert Choice software. Findings In this paper, we identified the 17 service quality requirements in the mobile tourism applications. The results reveal that the service requirement such as Geo-location map, Multilingual option, Compatibility with different operating systems were unavoidable service, absent of such requirements leads to the dissatisfaction. Based on the results of the integrated application of both Kano model and AHP analysis, this study provide specific implications for improving the service quality of the mobile tourism applications in Korea.

Generating a Korean Sentiment Lexicon Through Sentiment Score Propagation (감정점수의 전파를 통한 한국어 감정사전 생성)

  • Park, Ho-Min;Kim, Chang-Hyun;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.2
    • /
    • pp.53-60
    • /
    • 2020
  • Sentiment analysis is the automated process of understanding attitudes and opinions about a given topic from written or spoken text. One of the sentiment analysis approaches is a dictionary-based approach, in which a sentiment dictionary plays an much important role. In this paper, we propose a method to automatically generate Korean sentiment lexicon from the well-known English sentiment lexicon called VADER (Valence Aware Dictionary and sEntiment Reasoner). The proposed method consists of three steps. The first step is to build a Korean-English bilingual lexicon using a Korean-English parallel corpus. The bilingual lexicon is a set of pairs between VADER sentiment words and Korean morphemes as candidates of Korean sentiment words. The second step is to construct a bilingual words graph using the bilingual lexicon. The third step is to run the label propagation algorithm throughout the bilingual graph. Finally a new Korean sentiment lexicon is generated by repeatedly applying the propagation algorithm until the values of all vertices converge. Empirically, the dictionary-based sentiment classifier using the Korean sentiment lexicon outperforms machine learning-based approaches on the KMU sentiment corpus and the Naver sentiment corpus. In the future, we will apply the proposed approach to generate multilingual sentiment lexica.