• 제목/요약/키워드: corpus-based task

검색결과 37건 처리시간 0.023초

A New Speaker Adaptation Technique using Maximum Model Distance

  • Lee, Man-Hyung;Hong, Suh-Il
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 제어로봇시스템학회 2001년도 ICCAS
    • /
    • pp.99.1-99
    • /
    • 2001
  • This paper presented an adaptation approach based on maximum model distance (MMD) method. This method shares the same framework as they are used for training speech recognizers with abundant training data. The MMD method could adapt to all the models with or without adaptation data. If large amount of adaptation data is available, these methods could gradually approximate the speaker-dependent ones. The approach is evaluated through the phoneme recognition task on the TIMIT corpus. On the speaker adaptation experiments, up to 65.55% phoneme error reduction is achieved. The MMD could reduce phoneme error by 16.91% even when only one adaptation utterance is used.

  • PDF

Question Similarity Measurement of Chinese Crop Diseases and Insect Pests Based on Mixed Information Extraction

  • Zhou, Han;Guo, Xuchao;Liu, Chengqi;Tang, Zhan;Lu, Shuhan;Li, Lin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제15권11호
    • /
    • pp.3991-4010
    • /
    • 2021
  • The Question Similarity Measurement of Chinese Crop Diseases and Insect Pests (QSM-CCD&IP) aims to judge the user's tendency to ask questions regarding input problems. The measurement is the basis of the Agricultural Knowledge Question and Answering (Q & A) system, information retrieval, and other tasks. However, the corpus and measurement methods available in this field have some deficiencies. In addition, error propagation may occur when the word boundary features and local context information are ignored when the general method embeds sentences. Hence, these factors make the task challenging. To solve the above problems and tackle the Question Similarity Measurement task in this work, a corpus on Chinese crop diseases and insect pests(CCDIP), which contains 13 categories, was established. Then, taking the CCDIP as the research object, this study proposes a Chinese agricultural text similarity matching model, namely, the AgrCQS. This model is based on mixed information extraction. Specifically, the hybrid embedding layer can enrich character information and improve the recognition ability of the model on the word boundary. The multi-scale local information can be extracted by multi-core convolutional neural network based on multi-weight (MM-CNN). The self-attention mechanism can enhance the fusion ability of the model on global information. In this research, the performance of the AgrCQS on the CCDIP is verified, and three benchmark datasets, namely, AFQMC, LCQMC, and BQ, are used. The accuracy rates are 93.92%, 74.42%, 86.35%, and 83.05%, respectively, which are higher than that of baseline systems without using any external knowledge. Additionally, the proposed method module can be extracted separately and applied to other models, thus providing reference for related research.

Feature Extraction Based on Speech Attractors in the Reconstructed Phase Space for Automatic Speech Recognition Systems

  • Shekofteh, Yasser;Almasganj, Farshad
    • ETRI Journal
    • /
    • 제35권1호
    • /
    • pp.100-108
    • /
    • 2013
  • In this paper, a feature extraction (FE) method is proposed that is comparable to the traditional FE methods used in automatic speech recognition systems. Unlike the conventional spectral-based FE methods, the proposed method evaluates the similarities between an embedded speech signal and a set of predefined speech attractor models in the reconstructed phase space (RPS) domain. In the first step, a set of Gaussian mixture models is trained to represent the speech attractors in the RPS. Next, for a new input speech frame, a posterior-probability-based feature vector is evaluated, which represents the similarity between the embedded frame and the learned speech attractors. We conduct experiments for a speech recognition task utilizing a toolkit based on hidden Markov models, over FARSDAT, a well-known Persian speech corpus. Through the proposed FE method, we gain 3.11% absolute phoneme error rate improvement in comparison to the baseline system, which exploits the mel-frequency cepstral coefficient FE method.

The Sequence Labeling Approach for Text Alignment of Plagiarism Detection

  • Kong, Leilei;Han, Zhongyuan;Qi, Haoliang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제13권9호
    • /
    • pp.4814-4832
    • /
    • 2019
  • Plagiarism detection is increasingly exploiting text alignment. Text alignment involves extracting the plagiarism passages in a pair of the suspicious document and its source document. The heuristics have achieved excellent performance in text alignment. However, the further improvements of the heuristic methods mainly depends more on the experiences of experts, which makes the heuristics lack of the abilities for continuous improvements. To address this problem, machine learning maybe a proper way. Considering the position relations and the context of text segments pairs, we formalize the text alignment task as a problem of sequence labeling, improving the current methods at the model level. Especially, this paper proposes to use the probabilistic graphical model to tag the observed sequence of pairs of text segments. Hence we present the sequence labeling approach for text alignment in plagiarism detection based on Conditional Random Fields. The proposed approach is evaluated on the PAN@CLEF 2012 artificial high obfuscation plagiarism corpus and the simulated paraphrase plagiarism corpus, and compared with the methods achieved the best performance in PAN@CLEF 2012, 2013 and 2014. Experimental results demonstrate that the proposed approach significantly outperforms the state of the art methods.

다중작업학습 기법을 적용한 Bi-LSTM 개체명 인식 시스템 성능 비교 분석 (Performance Comparison Analysis on Named Entity Recognition system with Bi-LSTM based Multi-task Learning)

  • 김경민;한승규;오동석;임희석
    • 디지털융복합연구
    • /
    • 제17권12호
    • /
    • pp.243-248
    • /
    • 2019
  • 다중작업학습(Multi-Task Learning, MTL) 기법은 하나의 신경망을 통해 다양한 작업을 동시에 수행하고 각 작업 간에 상호적으로 영향을 미치면서 학습하는 방식을 말한다. 본 연구에서는 전통문화 말뭉치를 직접 구축 및 학습데이터로 활용하여 다중작업학습 기법을 적용한 개체명 인식 모델에 대해 성능 비교 분석을 진행한다. 학습 과정에서 각각의 품사 태깅(Part-of-Speech tagging, POS-tagging) 과 개체명 인식(Named Entity Recognition, NER) 학습 파라미터에 대해 Bi-LSTM 계층을 통과시킨 후 각각의 Bi-LSTM을 계층을 통해 최종적으로 두 loss의 joint loss를 구한다. 결과적으로, Bi-LSTM 모델을 활용하여 단일 Bi-LSTM 모델보다 MTL 기법을 적용한 모델에서 1.1%~4.6%의 성능 향상이 있음을 보인다.

Part-of-speech Tagging for Hindi Corpus in Poor Resource Scenario

  • Modi, Deepa;Nain, Neeta;Nehra, Maninder
    • Journal of Multimedia Information System
    • /
    • 제5권3호
    • /
    • pp.147-154
    • /
    • 2018
  • Natural language processing (NLP) is an emerging research area in which we study how machines can be used to perceive and alter the text written in natural languages. We can perform different tasks on natural languages by analyzing them through various annotational tasks like parsing, chunking, part-of-speech tagging and lexical analysis etc. These annotational tasks depend on morphological structure of a particular natural language. The focus of this work is part-of-speech tagging (POS tagging) on Hindi language. Part-of-speech tagging also known as grammatical tagging is a process of assigning different grammatical categories to each word of a given text. These grammatical categories can be noun, verb, time, date, number etc. Hindi is the most widely used and official language of India. It is also among the top five most spoken languages of the world. For English and other languages, a diverse range of POS taggers are available, but these POS taggers can not be applied on the Hindi language as Hindi is one of the most morphologically rich language. Furthermore there is a significant difference between the morphological structures of these languages. Thus in this work, a POS tagger system is presented for the Hindi language. For Hindi POS tagging a hybrid approach is presented in this paper which combines "Probability-based and Rule-based" approaches. For known word tagging a Unigram model of probability class is used, whereas for tagging unknown words various lexical and contextual features are used. Various finite state machine automata are constructed for demonstrating different rules and then regular expressions are used to implement these rules. A tagset is also prepared for this task, which contains 29 standard part-of-speech tags. The tagset also includes two unique tags, i.e., date tag and time tag. These date and time tags support all possible formats. Regular expressions are used to implement all pattern based tags like time, date, number and special symbols. The aim of the presented approach is to increase the correctness of an automatic Hindi POS tagging while bounding the requirement of a large human-made corpus. This hybrid approach uses a probability-based model to increase automatic tagging and a rule-based model to bound the requirement of an already trained corpus. This approach is based on very small labeled training set (around 9,000 words) and yields 96.54% of best precision and 95.08% of average precision. The approach also yields best accuracy of 91.39% and an average accuracy of 88.15%.

A Multi-task Self-attention Model Using Pre-trained Language Models on Universal Dependency Annotations

  • Kim, Euhee
    • 한국컴퓨터정보학회논문지
    • /
    • 제27권11호
    • /
    • pp.39-46
    • /
    • 2022
  • 본 논문에서는 UD Korean Kaist v2.3 코퍼스를 이용하여 범용 품사 태깅, 표제어추출 그리고 의존 구문분석을 동시에 예측할 수 있는 보편적 다중 작업 모델을 제안하였다. 제안 모델은 사전학습 언어모델인 다국어 BERT (Multilingual BERT)와 한국어 BERT (KR-BERT와 KoBERT)을 대상으로 추가학습 (fine-tuning)을 수행하여 BERT 모델의 자가-집중 (self-attention) 기법과 그래프 기반 Biaffine attention 기법을 적용하여 제안 모델의 성능을 비교 분석하였다.

Using the PubAnnotation ecosystem to perform agile text mining on Genomics & Informatics: a tutorial review

  • Nam, Hee-Jo;Yamada, Ryota;Park, Hyun-Seok
    • Genomics & Informatics
    • /
    • 제18권2호
    • /
    • pp.13.1-13.6
    • /
    • 2020
  • The prototype version of the full-text corpus of Genomics & Informatics has recently been archived in a GitHub repository. The full-text publications of volumes 10 through 17 are also directly downloadable from PubMed Central (PMC) as XML files. During the Biomedical Linked Annotation Hackathon 6 (BLAH6), we experimented with converting, annotating, and updating 301 PMC full-text articles of Genomics & Informatics using PubAnnotation, a system that provides a convenient way to add PMC publications based on PMCID. Thus, this review aims to provide a tutorial overview of practicing the iterative task of named entity recognition with the PubAnnotation/PubDictionaries/TextAE ecosystem. We also describe developing a conversion tool between the Genia tagger output and the JSON format of PubAnnotation during the hackathon.

번역 품질 예측을 위한 HTER 분포 평준화 기반 인조 번역 품질 말뭉치 구축 방법 (Construction of an Artificial Training Corpus for The Quality Estimation Task based on HTER Distribution Equalization)

  • 박준수;이원기;신재훈;한효정;이종혁
    • 한국정보과학회 언어공학연구회:학술대회논문집(한글 및 한국어 정보처리)
    • /
    • 한국정보과학회언어공학연구회 2019년도 제31회 한글 및 한국어 정보처리 학술대회
    • /
    • pp.460-464
    • /
    • 2019
  • 번역 품질 예측은 기계번역 시스템이 생성한 번역문의 품질을 정답 번역문을 참고하지 않고 예측하는 과정으로, 번역문의 사후 교정을 위한 번역 오류 검출의 역할을 담당하는 중요한 연구이다. 본 논문은 문장 수준의 번역 품질 예측 문제를 HTER 구간의 분류 문제로 간주하여, 번역 품질 말뭉치의 HTER 분포 불균형으로 인한 성능 제약을 완화하기 위해 인조 사후 교정 말뭉치를 이용하는 방법을 제안하였다. 결과적으로 HTER 분포를 균등하게 조정한 학습 말뭉치가 그렇지 않은 쪽에 비해 번역 품질 예측에 더 효과적인 것을 보였다.

  • PDF

A review of Chinese named entity recognition

  • Cheng, Jieren;Liu, Jingxin;Xu, Xinbin;Xia, Dongwan;Liu, Le;Sheng, Victor S.
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제15권6호
    • /
    • pp.2012-2030
    • /
    • 2021
  • Named Entity Recognition (NER) is used to identify entity nouns in the corpus such as Location, Person and Organization, etc. NER is also an important basic of research in various natural language fields. The processing of Chinese NER has some unique difficulties, for example, there is no obvious segmentation boundary between each Chinese character in a Chinese sentence. The Chinese NER task is often combined with Chinese word segmentation, and so on. In response to these problems, we summarize the recognition methods of Chinese NER. In this review, we first introduce the sequence labeling system and evaluation metrics of NER. Then, we divide Chinese NER methods into rule-based methods, statistics-based machine learning methods and deep learning-based methods. Subsequently, we analyze in detail the model framework based on deep learning and the typical Chinese NER methods. Finally, we put forward the current challenges and future research directions of Chinese NER technology.