• Title/Summary/Keyword: Corpus-based Study

Search Result 204, Processing Time 0.028 seconds

An Intonation Study of Predicate ending in Current Korean - From final endings of ${\ulcorner}$-a/e, $t{\int}ijo$${\lrcorner}$ and ${\ulcorner}$p/simnida${\lrcorner}$ - (현대 서울말 평서문에 나타나는 억양 연구 - 어말어미 "-아/어, -지요" 와 "-ㅂ/습니다" 를 중심으로 -)

  • Yu, Ki-Won
    • Proceedings of the KSPS conference
    • /
    • 2005.04a
    • /
    • pp.3-7
    • /
    • 2005
  • This research is for finding prototypes and characteristics of intonation found in ${\ulcorner}$-a/e, $t{\int}ijo$<${\lrcorner}$ and ${\ulcorner}$p/simnida${\lrcorner}$ among modern Korean predicate statements by constructing spoken corpus based on the current radio broadcast. So the result of the study is as follows. : (1) The construction of the balanced spoken corpus and the standard for boundary determination of rhythm are needed for the intonation model of speech synthesis. (2) Korean intonation units have the splited word tone which includes the nuclear tone and the pre-nuclear tone makes unclear tone more detailed. (3) I made man and woman intonation models individually through t-test of SPSS. (4) The standard intonation model is devided '-ajo'type and '-nida'type

  • PDF

Vocabulary Analyzer Based on CEFR-J Wordlist for Self-Reflection (VACSR) Version 2

  • Yukiko Ohashi;Noriaki Katagiri;Takao Oshikiri
    • Asia Pacific Journal of Corpus Research
    • /
    • v.4 no.2
    • /
    • pp.75-87
    • /
    • 2023
  • This paper presents a revised version of the vocabulary analyzer for self-reflection (VACSR), called VACSR v.2.0. The initial version of the VACSR automatically analyzes the occurrences and the level of vocabulary items in the transcribed texts, indicating the frequency, the unused vocabulary items, and those not belonging to either scale. However, it overlooked words with multiple parts of speech due to their identical headword representations. It also needed to provide more explanatory result tables from different corpora. VACSR v.2.0 overcomes the limitations of its predecessor. First, unlike VACSR v.1, VACSR v.2.0 distinguishes words that are different parts of speech by syntactic parsing using Stanza, an open-source Python library. It enables the categorization of the same lexical items with multiple parts of speech. Second, VACSR v.2.0 overcomes the limited clarity of VACSR v.1 by providing precise result output tables. The updated software compares the occurrence of vocabulary items included in classroom corpora for each level of the Common European Framework of Reference-Japan (CEFR-J) wordlist. A pilot study utilizing VACSR v.2.0 showed that, after converting two English classes taught by a preservice English teacher into corpora, the headwords used mostly corresponded to CEFR-J level A1. In practice, VACSR v.2.0 will promote users' reflection on their vocabulary usage and can be applied to teacher training.

Building and Analyzing Panic Disorder Social Media Corpus for Automatic Deep Learning Classification Model (딥러닝 자동 분류 모델을 위한 공황장애 소셜미디어 코퍼스 구축 및 분석)

  • Lee, Soobin;Kim, Seongdeok;Lee, Juhee;Ko, Youngsoo;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.38 no.2
    • /
    • pp.153-172
    • /
    • 2021
  • This study is to create a deep learning based classification model to examine the characteristics of panic disorder and to classify the panic disorder tendency literature by the panic disorder corpus constructed for the present study. For this purpose, 5,884 documents of the panic disorder corpus collected from social media were directly annotated based on the mental disease diagnosis manual and were classified into panic disorder-prone and non-panic-disorder documents. Then, TF-IDF scores were calculated and word co-occurrence analysis was performed to analyze the lexical characteristics of the corpus. In addition, the co-occurrence between the symptom frequency measurement and the annotated symptom was calculated to analyze the characteristics of panic disorder symptoms and the relationship between symptoms. We also conducted the performance evaluation for a deep learning based classification model. Three pre-trained models, BERT multi-lingual, KoBERT, and KcBERT, were adopted for classification model, and KcBERT showed the best performance among them. This study demonstrated that it can help early diagnosis and treatment of people suffering from related symptoms by examining the characteristics of panic disorder and expand the field of mental illness research to social media.

A Comparative Study on Oral Fluency Between Korean Native Speakers and L2 Korean Learners in Speech Discourse - With Focus on Speech Rate, Pause, and Discourse Markers (발표 담화에서의 한국어 모어 화자와 한국어 학습자의 말하기 유창성 비교 연구 -발화 속도, 휴지, 담화표지를 중심으로-)

  • Lee, Jin;Jung, Jinkyung
    • Journal of Korean language education
    • /
    • v.29 no.4
    • /
    • pp.137-168
    • /
    • 2018
  • The purpose of this study is to prepare the basis for a more objective evaluation of oral fluency by comparing speech patterns of Korean native speakers and L2 Korean learners. For this purpose, the current study focused on the analysis of speech materials of the 21st century Sejong spoken corpus and Korean learner corpus. We compared the oral fluency of Korean native speakers and Korean learners based on speech rate, pause, and discourse markers. The results show that the pattern of Korean learners is different to that of Korean native speakers in all aspects of speech rate, pause, and discourse markers; even though proficiency of Korean leaners show increase, they could not reach the oral fluency level of Korean native speakers. At last, based on these results of the analysis, we added suggestions for setting the evaluation criteria of oral fluency of Korean learners.

Identification of Profane Words in Cyberbullying Incidents within Social Networks

  • Ali, Wan Noor Hamiza Wan;Mohd, Masnizah;Fauzi, Fariza
    • Journal of Information Science Theory and Practice
    • /
    • v.9 no.1
    • /
    • pp.24-34
    • /
    • 2021
  • The popularity of social networking sites (SNS) has facilitated communication between users. The usage of SNS helps users in their daily life in various ways such as sharing of opinions, keeping in touch with old friends, making new friends, and getting information. However, some users misuse SNS to belittle or hurt others using profanities, which is typical in cyberbullying incidents. Thus, in this study, we aim to identify profane words from the ASKfm corpus to analyze the profane word distribution across four different roles involved in cyberbullying based on lexicon dictionary. These four roles are: harasser, victim, bystander that assists the bully, and bystander that defends the victim. Evaluation in this study focused on occurrences of the profane word for each role from the corpus. The top 10 common words used in the corpus are also identified and represented in a graph. Results from the analysis show that these four roles used profane words in their conversation with different weightage and distribution, even though the profane words used are mostly similar. The harasser is the first ranked that used profane words in the conversation compared to other roles. The results can be further explored and considered as a potential feature in a cyberbullying detection model using a machine learning approach. Results in this work will contribute to formulate the suitable representation. It is also useful in modeling a cyberbullying detection model based on the identification of profane word distribution across different cyberbullying roles in social networks for future works.

A Study on Implementation of Emotional Speech Synthesis System using Variable Prosody Model (가변 운율 모델링을 이용한 고음질 감정 음성합성기 구현에 관한 연구)

  • Min, So-Yeon;Na, Deok-Su
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.14 no.8
    • /
    • pp.3992-3998
    • /
    • 2013
  • This paper is related to the method of adding a emotional speech corpus to a high-quality large corpus based speech synthesizer, and generating various synthesized speech. We made the emotional speech corpus as a form which can be used in waveform concatenated speech synthesizer, and have implemented the speech synthesizer that can be generated various synthesized speech through the same synthetic unit selection process of normal speech synthesizer. We used a markup language for emotional input text. Emotional speech is generated when the input text is matched as much as the length of intonation phrase in emotional speech corpus, but in the other case normal speech is generated. The BIs(Break Index) of emotional speech is more irregular than normal speech. Therefore, it becomes difficult to use the BIs generated in a synthesizer as it is. In order to solve this problem we applied the Variable Break[3] modeling. We used the Japanese speech synthesizer for experiment. As a result we obtained the natural emotional synthesized speech using the break prediction module for normal speech synthesize.

Constructing for Korean Traditional culture Corpus and Development of Named Entity Recognition Model using Bi-LSTM-CNN-CRFs (한국 전통문화 말뭉치구축 및 Bi-LSTM-CNN-CRF를 활용한 전통문화 개체명 인식 모델 개발)

  • Kim, GyeongMin;Kim, Kuekyeng;Jo, Jaechoon;Lim, HeuiSeok
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.12
    • /
    • pp.47-52
    • /
    • 2018
  • Named Entity Recognition is a system that extracts entity names such as Persons(PS), Locations(LC), and Organizations(OG) that can have a unique meaning from a document and determines the categories of extracted entity names. Recently, Bi-LSTM-CRF, which is a combination of CRF using the transition probability between output data from LSTM-based Bi-LSTM model considering forward and backward directions of input data, showed excellent performance in the study of object name recognition using deep-learning, and it has a good performance on the efficient embedding vector creation by character and word unit and the model using CNN and LSTM. In this research, we describe the Bi-LSTM-CNN-CRF model that enhances the features of the Korean named entity recognition system and propose a method for constructing the traditional culture corpus. We also present the results of learning the constructed corpus with the feature augmentation model for the recognition of Korean object names.

Comparing Initial Magnetic Resonance Imaging Findings to Differentiate between Krabbe Disease and Metachromatic Leukodystrophy in Children

  • Koh, Seok Young;Choi, Young Hun;Lee, Seul Bi;Lee, Seunghyun;Cho, Yeon Jin;Cheon, Jung-Eun
    • Investigative Magnetic Resonance Imaging
    • /
    • v.25 no.2
    • /
    • pp.101-108
    • /
    • 2021
  • Purpose: To identify characteristic magnetic resonance imaging (MRI) features to differentiate between Krabbe disease and metachromatic leukodystrophy (MLD) in young children. Materials and Methods: We collected all confirmed cases of Krabbe disease and MLD between October 2004 and September 2020 at Seoul National University Children's Hospital. Patients with initial MRI available were included. Their initial MRIs were retrospectively reviewed for the following: 1) presence of white matter signal abnormality involving the periventricular and deep white matter, subcortical white matter, internal capsule, brainstem, and cerebellum; 2) presence of volume decrease and signal alteration in the corpus callosum and thalamus; 3) presence of the tigroid sign; 4) presence of optic nerve hypertrophy; and 5) presence of enhancement or diffusion restriction. Results: Eleven children with Krabbe disease and 12 children with MLD were included in this study. There was no significant difference in age or symptoms at onset. Periventricular and deep white matter signal alterations sparing the subcortical white matter were present in almost all patients of the two groups. More patients with Krabbe disease had T2 hyperintensities in the internal capsule and brainstem than patients with MLDs. In contrast, more patients with MLD had T2 hyperintensities in the splenium and genu of the corpus callosum. No patient with Krabbe disease showed T2 hyperintensity in the corpus callosal genu. A decrease in volume in the corpus callosum and thalamus was more frequently observed in patients with Krabbe disease than in those with MLD. Other MRI findings including the tigroid sign and optic nerve hypertrophy were not significantly different between the two groups. Conclusion: Signal abnormalities in the internal capsule and brainstem, decreased thalamic volume, decreased splenial volume accompanied by signal changes, and absence of signal changes in the callosal genu portion were MRI findings suggestive of Krabbe disease rather than MLD based on initial MRI. Other MRI findings such as the tigroid sign could not help differentiate between these two diseases.

Syntactic Structure of English Split Infinitives from the Perspectives of Grammaticalization and Corpus (문법화와 코퍼스의 관점에서 본 영어 분리부정사 통사구조)

  • Kim, Yangsoon
    • The Journal of the Convergence on Culture Technology
    • /
    • v.6 no.3
    • /
    • pp.245-251
    • /
    • 2020
  • From the perspectives of grammaticalization and corpus, the purpose of this study is to examine the motivation of the emergence of the split infinitives in American English and to discuss the justification of the split infinitives based on the corpus empirical data such as COHA and COCA. The formerly ungrammatical split infinitives in the form of [to + adverb + verb] are now definitely grammatical forms in Present Day English (PDE). The corpus-based data confirms the legitimacy of the split infinitives with the empirical reasons like clarifying sentences (i.e., disambiguation) or strongly focused readings. In addition, the split infinitives are natural consequences caused by the grammaticalization of an infinitival particle to and most crucially by the loss of verb movement. When verb movement to T position does not occur in infinitival clauses, the word order results in [to + AdvP + V], thus forming the split infinitives. The split infinitives are no longer a matter of discussion and will continue to increase in both formal and informal contexts as being definitely grammatical forms.

A Study on the Simple Algorithm for Discrimination of Voiced Sounds (유성음 구간 검출을 위한 간단한 알고리즘에 관한 연구)

  • 장규철;우수영;박용규;유창동
    • The Journal of the Acoustical Society of Korea
    • /
    • v.21 no.8
    • /
    • pp.727-734
    • /
    • 2002
  • A simple algorithm for discriminating voiced sounds in a speech is proposed in this paper. In addition to low-frequency energy and zero-crossing rate (ZCR), both of which have been widely used in the past for identifying voiced sounds, the proposed algorithm incorporates pitch variation to improve the discrimination rate. Based on TIMIT corpus, evaluation result shows an improvement of 13% in the discrimination of voiced phonemes over that of the traditional algorithm using only energy and ZCR.