• Title/Summary/Keyword: Corpus Utility

Search Result 5, Processing Time 0.018 seconds

KKMA : A Tool for Utilizing Sejong Corpus based on Relational Database (꼬꼬마 : 관계형 데이터베이스를 활용한 세종 말뭉치 활용 도구)

  • Lee, Dong-Joo;Yeon, Jong-Heum;Hwang, In-Beom;Lee, Sang-Goo
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.11
    • /
    • pp.1046-1050
    • /
    • 2010
  • Corpus is widely used as a fundamental resource for various purposes in linguistic studies. There are several large corpora such as Sejong corpus in Korea. However, it is hard to find a tool utilizing such large corpora. In this paper, we propose a method of utilizing Sejong corpus based on the relational database. We designed the relational database scheme to store corpus and implemented a Web-based application so that many researchers can easily access and utilize the Sejong corpus.

Corpus of Eye Movements in L3 Spanish Reading: A Prediction Model

  • Hui-Chuan Lu;Li-Chi Kao;Zong-Han Li;Wen-Hsiang Lu;An-Chung Cheng
    • Asia Pacific Journal of Corpus Research
    • /
    • v.5 no.1
    • /
    • pp.23-36
    • /
    • 2024
  • This research centers on the Taiwan Eye-Movement Corpus of Spanish (TECS), a specially created corpus comprising eye-tracking data from Chinese-speaking learners of Spanish as a third language in Taiwan. Its primary purpose is to explore the broad utility of TECS in understanding language learning processes, particularly the initial stages of language learning. Constructing this corpus involves gathering data on eye-tracking, reading comprehension, and language proficiency to develop a machine-learning model that predicts learner behaviors, and subsequently undergoes a predictability test for validation. The focus is on examining attention in input processing and their relationship to language learning outcomes. The TECS eye-tracking data consists of indicators derived from eye movement recordings while reading Spanish sentences with temporal references. These indicators are obtained from eye movement experiments focusing on tense verbal inflections and temporal adverbs. Chinese expresses tense using aspect markers, lexical references, and contextual cues, differing significantly from inflectional languages like Spanish. Chinese-speaking learners of Spanish face particular challenges in learning verbal morphology and tenses. The data from eye movement experiments were structured into feature vectors, with learner behaviors serving as class labels. After categorizing the collected data, we used two types of machine learning methods for classification and regression: Random Forests and the k-nearest neighbors algorithm (KNN). By leveraging these algorithms, we predicted learner behaviors and conducted performance evaluations to enhance our understanding of the nexus between learner behaviors and language learning process. Future research may further enrich TECS by gathering data from subsequent eye-movement experiments, specifically targeting various Spanish tenses and temporal lexical references during text reading. These endeavors promise to broaden and refine the corpus, advancing our understanding of language processing.

Developing a Korean Standard Speech DB (한국인 표준 음성 DB 구축)

  • Shin, Jiyoung;Jang, Hyejin;Kang, Younmin;Kim, Kyung-Wha
    • Phonetics and Speech Sciences
    • /
    • v.7 no.1
    • /
    • pp.139-150
    • /
    • 2015
  • The data accumulated in this database will be used to develop a speaker identification system. This may also be applied towards, but not limited to, fields of phonetic studies, sociolinguistics, and language pathology. We plan to supplement the large-scale speech corpus next year, in terms of research methodology and content, to better answer the needs of diverse fields. The purpose of this study is to develop a speech corpus for standard Korean speech. For the samples to viably represent the state of spoken Korean, demographic factors were considered to modulate a balanced spread of age, gender, and dialects. Nine separate regional dialects were categorized, and five age groups were established from individuals in their 20s to 60s. A speech-sample collection protocol was developed for the purpose of this study where each speaker performs five tasks: two reading tasks, two semi-spontaneous speech tasks, and one spontaneous speech task. This particular configuration of sample data collection accommodates gathering of rich and well-balanced speech-samples across various speech types, and is expected to improve the utility of the speech corpus developed in this study. Samples from 639 individuals were collected using the protocol. Speech samples were collected also from other sources, for a combined total of samples from 1,012 individuals.

An Exploratory Study of Collective E-Petitions Estimation Methodology Using Anomaly Detection: Focusing on the Voice of Citizens of Changwon City (이상탐지 활용 전자집단민원 추정 방법론에 관한 탐색적 연구: 창원시 시민의 소리 사례를 중심으로)

  • Jeong, Ha-Yeong
    • Informatization Policy
    • /
    • v.26 no.4
    • /
    • pp.85-106
    • /
    • 2019
  • Recently, there have been increasing cases of collective petitions filed in the electronic petitions system. However, there is no efficient management system, raising concerns on side effects such as increased administrative workload and mass production of social conflicts. Aimed at suggesting a methodology for estimating electronic collective petitions using anomaly detection and corpus linguistics-based content analysis, this study conducted the followings: i) a theoretical review of the concept of collective petitions, ii) estimation of electronic collective petitions using anomaly detection based on nonparametric unsupervised learning, iii) a content similarity analysis on petitions using n-gram cosine angle distance, and iv) a case study on the Voice of Citizens of Changwon City, through which the utility of the proposed methodology, policy implications and future tasks were reviewed.

Classification of nasal places of articulation based on the spectra of adjacent vowels (모음 스펙트럼에 기반한 전후 비자음 조음위치 판별)

  • Jihyeon Yun;Cheoljae Seong
    • Phonetics and Speech Sciences
    • /
    • v.15 no.1
    • /
    • pp.25-34
    • /
    • 2023
  • This study examined the utility of the acoustic features of vowels as cues for the place of articulation of Korean nasal consonants. In the acoustic analysis, spectral and temporal parameters were measured at the 25%, 50%, and 75% time points in the vowels neighboring nasal consonants in samples extracted from a spontaneous Korean speech corpus. Using these measurements, linear discriminant analyses were performed and classification accuracies for the nasal place of articulation were estimated. The analyses were applied separately for vowels following and preceding a nasal consonant to compare the effects of progressive and regressive coarticulation in terms of place of articulation. The classification accuracies ranged between approximately 50% and 60%, implying that acoustic measurements of vowel intervals alone are not sufficient to predict or classify the place of articulation of adjacent nasal consonants. However, given that these results were obtained for measurements at the temporal midpoint of vowels, where they are expected to be the least influenced by coarticulation, the present results also suggest the potential of utilizing acoustic measurements of vowels to improve the recognition accuracy of nasal place. Moreover, the classification accuracy for nasal place was higher for vowels preceding the nasal sounds, suggesting the possibility of higher anticipatory coarticulation reflecting the nasal place.