• 제목/요약/키워드: Corpus-based Study

검색결과 204건 처리시간 0.024초

L2 영어 학습자들의 연어 사용 능숙도와 텍스트 질 사이의 수치화 (Quantifying L2ers' phraseological competence and text quality in L2 English writing)

  • 권준혁;김재준;김유래;박명관;송상헌
    • 한국어정보학회:학술대회논문집
    • /
    • 한국어정보학회 2017년도 제29회 한글및한국어정보처리학술대회
    • /
    • pp.281-284
    • /
    • 2017
  • On the basis of studies that show multi-word combinations, that is the field of phraseology, this study aims to examine relationship between the quality of text and phraseological competence in L2 English writing, following Yves Bestegen et al. (2014). Using two different association scores, t-score and Mutual Information(MI), which are opposite ways of measuring phraseological competence, in terms of scoring frequency and infrequency, bigrams from L2 writers' text scored based on a reference corpus, GloWbE (Corpus of Global Web based English). On a cross-sectional approach, we propose that the quality of the essays and the mean MI score of the bigram extracted from YELC, Yonsei English Learner Corpus, correlated to each other. The negative scores of bigrams are also correlated with the quality of the essays in the way that these bigrams are absent from the reference corpus, that is mostly ungrammatical. It indicates that increase in the proportion of the negative scored bigrams debases the quality of essays. The conclusion shows the quality of the essays scored by MI and t-score on cross-sectional approach, and application to teaching method and assessment for second language writing proficiency.

  • PDF

코퍼스 분석방법을 이용한 『동의보감(東醫寶鑑)』의 어휘 분석 (Corpus-based Analysis on Vocabulary Found in 『Donguibogam』)

  • 정지훈;김동율
    • 한국의사학회지
    • /
    • 제28권1호
    • /
    • pp.135-141
    • /
    • 2015
  • The purpose of this study is to analyze vocabulary found in "Donguibogam", one of the medical books in mid-Chosun, through Corpus-based analysis, one of the text analysis methods. According to it, Donguibogam has total 871,000 words in it, and Chinese characters used in it are total 5,130. Among them, 2,430 characters form 99% of the entire text. The most frequently appearing 20 Chinese characters are mainly function words, and with this, we can see that "Donguibogam" is a book equipped with complete forms of sentences just like other books. Examining the chapters of "Donguibogam" by comparison, Remedies and Acupuncture indicated lower frequencies of function words than Internal Medicine, External Medicine, and Miscellaneous Diseases. "Yixuerumen (Introduction to Medicine)" which influenced "Donguibogam" very much has lower frequencies of function words than "Donguibogam" in its most frequently appearing words. This may be because "Yixuerumen" maintains the form of Chileonjeolgu (a quatrain with seven Chinese characters in each line with seven-word lines) and adds footnotes below it. Corpus-based analysis helps us to see the words mainly used by measuring their frequencies in the book of medicine. Therefore, this researcher suggests that the results of this analysis can be used for education of Chinese characters at the college of Korean Medicine.

A Corpus-Based Study of the Use of HEART and HEAD in English

  • Oh, Sang-suk
    • 한국언어정보학회지:언어와정보
    • /
    • 제18권2호
    • /
    • pp.81-102
    • /
    • 2014
  • The purpose of this paper is to provide corpus-based quantitative analyses of HEART and HEAD in order to examine their actual usage status and to consider some cognitive linguistic aspects associated with their use. The two corpora COCA and COHA are used for analysis in this study. The analysis of COCA corpus reveals that the total frequency of HEAD is much higher than that of HEART, and that the figurative use of HEART (60%) is two times higher than its literal use (32%); by contrast, the figurative use of HEAD (41%) is a bit higher than its literal use (38%). Among all four genres, both lexemes occur most frequently in fictions and then in magazines. Over the past two centuries, the use of HEART has been steadily decreasing; by contrast, that the use of HEAD has been steadily increasing. It is assumed that the decreasing use of HEART has partially to do with the decrease in its figurative use and that the increasing use of HEAD is attributable to its diverse meanings, the increase of its lexical use, and the partial increase in its figurative use. The analysis of the collocation of verbs and adjectives preceding HEART and HEAD, as well the modifying and predicating forms of HEART and HEAD also provides some relevant information of the usage of the two lexemes. This paper showcases that the quantitative information helps understanding not only of the actual usage of the two lexemes but also of the cognitive forces working behind it.

  • PDF

Effects of Corpus Use on Error Identification in L2 Writing

  • Yoshiho Satake
    • 아시아태평양코퍼스연구
    • /
    • 제4권1호
    • /
    • pp.61-71
    • /
    • 2023
  • This study examines the effects of data-driven learning (DDL)-an approach employing corpora for inductive language pattern learning-on error identification in second language (L2) writing. The data consists of error identification instances from fifty-five participants, compared across different reference materials: the Corpus of Contemporary American English (COCA), dictionaries, and no use of reference materials. There are three significant findings. First, the use of COCA effectively identified collocational and form-related errors due to inductive inference drawn from multiple example sentences. Secondly, dictionaries were beneficial for identifying lexical errors, where providing meaning information was helpful. Finally, the participants often employed a strategic approach, identifying many simple errors without reference materials. However, while maximizing error identification, this strategy also led to mislabeling correct expressions as errors. The author has concluded that the strategic selection of reference materials can significantly enhance the effectiveness of error identification in L2 writing. The use of a corpus offers advantages such as easy access to target phrases and frequency information-features especially useful given that most errors were collocational and form-related. The findings suggest that teachers should guide learners to effectively use appropriate reference materials to identify errors based on error types.

성대진동 및 성별이 미국영어 마찰음에 미치는 효과에 관한 코퍼스 기반 연구 (A corpus-based study on the effects of voicing and gender on American English Fricatives)

  • 윤태진
    • 말소리와 음성과학
    • /
    • 제10권2호
    • /
    • pp.7-14
    • /
    • 2018
  • The paper investigates the acoustic characteristics of English fricatives in the TIMIT corpus, with a special focus on the role of voicing in rendering fricatives in American English. The TIMIT database includes 630 talkers and 2,342 different sentences, and comprises more than five hours of speech. Acoustic analyses are conducted in the domain of spectral and temporal properties by treating gender, voicing, and place of articulation as independent factors. The results of the acoustic analyses revealed that acoustic signals interact in a complex way to signal the gender, place, and voicing of fricatives. Classification experiments using a multiclass support vector machine (SVM) revealed that 78.7% of fricatives are correctly classified. The majority of errors stem from the misclassification of /θ/ as [f] and /ʒ/ as [z]. The average accuracy of gender classification is 78.7%. Most errors result from the classification of female speakers as male speakers. The paper contributes to the understanding of the effects of voicing and gender on fricatives in a large-scale speech corpus.

Predicting CEFR Levels in L2 Oral Speech, Based on Lexical and Syntactic Complexity

  • Hu, Xiaolin
    • 아시아태평양코퍼스연구
    • /
    • 제2권1호
    • /
    • pp.35-45
    • /
    • 2021
  • With the wide spread of the Common European Framework of Reference (CEFR) scales, many studies attempt to apply them in routine teaching and rater training, while more evidence regarding criterial features at different CEFR levels are still urgently needed. The current study aims to explore complexity features that distinguish and predict CEFR proficiency levels in oral performance. Using a quantitative/corpus-based approach, this research analyzed lexical and syntactic complexity features over 80 transcriptions (includes A1, A2, B1 CEFR levels, and native speakers), based on an interview test, Standard Speaking Test (SST). ANOVA and correlation analysis were conducted to exclude insignificant complexity indices before the discriminant analysis. In the result, distinctive differences in complexity between CEFR speaking levels were observed, and with a combination of six major complexity features as predictors, 78.8% of the oral transcriptions were classified into the appropriate CEFR proficiency levels. It further confirms the possibility of predicting CEFR level of L2 learners based on their objective linguistic features. This study can be helpful as an empirical reference in language pedagogy, especially for L2 learners' self-assessment and teachers' prediction of students' proficiency levels. Also, it offers implications for the validation of the rating criteria, and improvement of rating system.

공공 한영 병렬 말뭉치를 이용한 기계번역 성능 향상 연구 (A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus)

  • 박찬준;임희석
    • 디지털융복합연구
    • /
    • 제18권6호
    • /
    • pp.271-277
    • /
    • 2020
  • 기계번역이란 소스언어를 목적언어로 컴퓨터가 번역하는 소프트웨어를 의미하며 규칙기반, 통계기반 기계번역을 거쳐 최근에는 인공신경망 기반 기계번역에 대한 연구가 활발히 이루어지고 있다. 인공신경망 기계번역에서 중요한 요소 중 하나로 고품질의 병렬 말뭉치를 뽑을 수 있는데 이제까지 한국어 관련 언어쌍의 고품질 병렬 코퍼스를 구하기 쉽지 않은 실정이었다. 최근 한국정보화진흥원의 AI HUB에서 고품질의 160만 문장의 한-영 기계번역 병렬 말뭉치를 공개하였다. 이에 본 논문은 AI HUB에서 공개한 데이터 및 현재까지 가장 많이 쓰인 한-영 병렬 데이터인 OpenSubtitles와 성능 비교를 통해 각각의 데이터의 품질을 검증하고자 한다. 테스트 데이터로 한-영 기계번역 관련 공식 테스트셋인 IWSLT에서 공개한 테스트셋을 이용하여 보다 객관성을 확보하였다. 실험결과 동일한 테스트셋으로 실험한 기존의 한-영 기계번역 관련 논문들보다 좋은 성능을 보임을 알 수 있었으며 이를 통해 고품질 데이터의 중요성을 알 수 있었다.

병렬 말뭉치 필터링을 적용한 Filter-mBART기반 기계번역 연구 (Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering)

  • 문현석;박찬준;어수경;박정배;임희석
    • 한국융합학회논문지
    • /
    • 제12권5호
    • /
    • pp.1-7
    • /
    • 2021
  • 최신 기계번역 연구 동향을 살펴보면 대용량의 단일말뭉치를 통해 모델의 사전학습을 거친 후 병렬 말뭉치로 미세조정을 진행한다. 많은 연구에서 사전학습 단계에 이용되는 데이터의 양을 늘리는 추세이나, 기계번역 성능 향상을 위해 반드시 데이터의 양을 늘려야 한다고는 보기 어렵다. 본 연구에서는 병렬 말뭉치 필터링을 활용한 mBART 모델 기반의 실험을 통해, 더 적은 양의 데이터라도 고품질의 데이터라면 더 좋은 기계번역 성능을 낼 수 있음을 보인다. 실험결과 병렬 말뭉치 필터링을 거친 사전학습모델이 그렇지 않은 모델보다 더 좋은 성능을 보였다. 본 실험결과를 통해 데이터의 양보다 데이터의 질을 고려하는 것이 중요함을 보이고, 해당 프로세스를 통해 추후 말뭉치 구축에 있어 하나의 가이드라인으로 활용될 수 있음을 보였다.

토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구 (A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning)

  • 육지희;송민
    • 정보관리학회지
    • /
    • 제35권2호
    • /
    • pp.63-88
    • /
    • 2018
  • 본 연구는 LDA 토픽 모델과 딥 러닝을 적용한 단어 임베딩 기반의 Doc2Vec 기법을 활용하여 자질을 선정하고 자질집합의 크기와 종류 및 분류 알고리즘에 따른 분류 성능의 차이를 평가하였다. 또한 자질집합의 적절한 크기를 확인하고 문헌의 위치에 따라 종류를 다르게 구성하여 분류에 이용할 때 높은 성능을 나타내는 자질집합이 무엇인지 확인하였다. 마지막으로 딥 러닝을 활용한 실험에서는 학습 횟수와 문맥 추론 정보의 유무에 따른 분류 성능을 비교하였다. 실험문헌집단은 PMC에서 제공하는 생의학 학술문헌을 수집하고 질병 범주 체계에 따라 구분하여 Disease-35083을 구축하였다. 연구를 통하여 가장 높은 성능을 나타낸 자질집합의 종류와 크기를 확인하고 학습 시간에 효율성을 나타냄으로써 자질로의 확장 가능성을 가지는 자질집합을 제시하였다. 또한 딥 러닝과 기존 방법 간의 차이점을 비교하고 분류 환경에 따라 적합한 방법을 제안하였다.

A Corpus-Based Analysis of Crosslinguistic Influence on the Acquisition of Concessive Conditionals in L2 English

  • Newbery-Payton, Laurence
    • 아시아태평양코퍼스연구
    • /
    • 제3권1호
    • /
    • pp.35-49
    • /
    • 2022
  • This study examines crosslinguistic influence on the use of concessive conditionals by Japanese EFL learners. Contrastive analysis suggests that Japanese native speakers may overuse the concessive conditional even if due to partial similarities to Japanese concessive conditionals, whose formal and semantic restrictions are fewer than those of English concessive conditionals. This hypothesis is tested using data from the written module of the International Corpus Network of Asian Learners of English (ICNALE). Comparison of Japanese native speakers with English native speakers and Chinese native speakers reveals the following trends. First, Japanese native speakers tend to overuse concessive conditionals compared to native speakers, while similar overuse is not observed in Chinese native speaker data. Second, non-nativelike uses of even if appear in contexts allowing the use of concessive conditionals in Japanese. Third, while overuse and infelicitous use of even if is observed at all proficiency levels, formal errors are restricted to learners at lower proficiency levels. These findings suggest that crosslinguistic influence does occur in the use of concessive conditionals, and that its particular realization is affected by L2 proficiency, with formal crosslinguistic influence mediated at an earlier stage than semantic cross-linguistic influence.