• 제목/요약/키워드: Corpus analysis

검색결과 419건 처리시간 0.028초

A Comparison of Korean EFL Learners' Oral and Written Productions

  • Lee, Eun-Ha
    • 영어어문교육
    • /
    • 제12권2호
    • /
    • pp.61-85
    • /
    • 2006
  • The purpose of the present study is to compare Korean EFL learners' speech corpus (i.e. oral productions) with their composition corpus (i.e. written productions). Four college students participated in the study. The composition corpus was collected through a writing assignment, and the speech corpus was gathered by audio-taping their oral presentations. The results of the data analysis indicate that (i) As for error frequency, young adult low-intermediate Korean EFL learners showed high frequency in determiners (mostly, indefinite articles), vocabulary (mostly, semantic errors), and prepositions. The frequency order did not show much difference between the speech corpus and the composition corpus; and (ii) When comparing the oral productions with the written productions, there were not many differences between them in terms of the contents, a style (i.e., colloquial vs. literary), vocabulary selection, and error types and frequency. Therefore, it is assumed that the proficiency in oral presentation of EFL learners at this learning stage heavily depends on how much/how well they are able to write. In other words, EFL learners' writing and speaking skills are closely co-related. It implies that the teacher does not need to separate teaching how to speak from teaching how to write. The teacher may use the same methods or strategies to help the learners improve their English speaking and writing skills. Furthermore, it will be more effective to teach writing before speaking since they have more opportunities to write than speak in the EFL contexts.

  • PDF

구어체 말뭉치의 어휘 사용 특징 분석 및 감정 어휘 사전의 자동 구축 (Analyzing Vocabulary Characteristics of Colloquial Style Corpus and Automatic Construction of Sentiment Lexicon)

  • 강승식;원혜진;이민행
    • 스마트미디어저널
    • /
    • 제9권4호
    • /
    • pp.144-151
    • /
    • 2020
  • 모바일 환경에서 의사소통은 SMS 문자로 이루어진다. SMS 문자에서 사용되는 어휘들은 일반적인 한국어 문어체 문장에서 사용되는 어휘들과 다른 부류의 어휘들이 사용될 것으로 예상할 수 있다. 예를 들어, 일반적인 문어체의 경우 문장의 시작이나 끝맺음이 올바르고 문장의 구성요소가 잘 갖추어졌지만, SMS 문자 말뭉치의 경우 구성요소를 생략 및 간략한 표현으로 대체하는 경우가 많다. 이러한 어휘 사용 특성을 분석하기 위하여, 기존에 구축된 구어체 말뭉치와 문어체 말뭉치를 사용한다. 실험에서는 구어체 말뭉치인 SMS 문자 말뭉치와 네이버 영화평 말뭉치, 그리고 문어체 말뭉치인 한국어 문어체 원시 말뭉치의 어휘사용 특성을 비교-분석한다. 말뭉치별 어휘 비교 및 분석을 위하여 품사 태그 형용사(VA)를 기준으로 하였고, 공연강도를 측정하기 위해 변별적 공연어휘소 분석 방법론을 사용하였다. 그 결과 '좋-', '죄송하-', '즐겁-' 등 감정표현 형용사들이 SMS 문자 말뭉치에서 선호되는 반면, 네이버 영화평 말뭉치에서는 평가 표현과 관련된 형용사들이 선호되는 것을 확인할 수 있었다. 이러한 과정에서 추출된 공연강도가 높은 형용사를 기준으로 감정어휘 사전을 자동 구축하기 위하여 단어 임베딩 기법을 사용하였으며, 총 343,603개의 감성어휘를 자동 구축하였다.

딥러닝을 위한 텍스트 전처리에 따른 단어벡터 분석의 차이 연구 (Study on Difference of Wordvectors Analysis Induced by Text Preprocessing for Deep Learning)

  • 고광호
    • 문화기술의 융합
    • /
    • 제8권5호
    • /
    • pp.489-495
    • /
    • 2022
  • 언어모델(Language Model)을 구축하기 위한 딥러닝 기법인 LSTM의 경우 학습에 사용되는 말뭉치의 전처리 방식에 따라 그 결과가 달라진다. 본 연구에서는 유명한 문학작품(기형도의 시집)을 말뭉치로 사용하여 LSTM 모델을 학습시켰다. 원문을 그대로 사용하는 경우와 조사/어미 등을 삭제한 경우에 따라 상이한 단어벡터 세트를 각각 얻을 수 있다. 이러한 전처리 방식에 따른 유사도/유추 연산 결과, 단어벡터의 평면상의 위치 및 언어모델의 텍스트생성 결과를 비교분석했다. 문학작품을 말뭉치로 사용하는 경우, 전처리 방식에 따라 연산된 단어는 달라지지만, 단어들의 유사도가 높고 유추관계의 상관도가 높다는 것을 알 수 있었다. 평면상의 단어 위치 역시 달라지지만 원래의 맥락과 어긋나지 않았고, 생성된 텍스트는 원래의 분위기와 비슷하면서도 이색적인 작품으로 감상할 수 있었다. 이러한 분석을 통해 문학작품을 객관적이고 다채롭게 향유할 수 있는 수단으로 딥러닝 기법의 언어모델을 활용할 수 있다고 판단된다.

코퍼스 분석방법을 이용한 『동의보감(東醫寶鑑)』의 어휘 분석 (Corpus-based Analysis on Vocabulary Found in 『Donguibogam』)

  • 정지훈;김동율
    • 한국의사학회지
    • /
    • 제28권1호
    • /
    • pp.135-141
    • /
    • 2015
  • The purpose of this study is to analyze vocabulary found in "Donguibogam", one of the medical books in mid-Chosun, through Corpus-based analysis, one of the text analysis methods. According to it, Donguibogam has total 871,000 words in it, and Chinese characters used in it are total 5,130. Among them, 2,430 characters form 99% of the entire text. The most frequently appearing 20 Chinese characters are mainly function words, and with this, we can see that "Donguibogam" is a book equipped with complete forms of sentences just like other books. Examining the chapters of "Donguibogam" by comparison, Remedies and Acupuncture indicated lower frequencies of function words than Internal Medicine, External Medicine, and Miscellaneous Diseases. "Yixuerumen (Introduction to Medicine)" which influenced "Donguibogam" very much has lower frequencies of function words than "Donguibogam" in its most frequently appearing words. This may be because "Yixuerumen" maintains the form of Chileonjeolgu (a quatrain with seven Chinese characters in each line with seven-word lines) and adds footnotes below it. Corpus-based analysis helps us to see the words mainly used by measuring their frequencies in the book of medicine. Therefore, this researcher suggests that the results of this analysis can be used for education of Chinese characters at the college of Korean Medicine.

A Novel Theory of Support in Social Media Discourse

  • Solomon, Bazil Stanley
    • 아시아태평양코퍼스연구
    • /
    • 제1권1호
    • /
    • pp.95-125
    • /
    • 2020
  • This paper aims to inform people how to support each other on social media. It alludes to an architecture for social media discourse and proposes a novel theory of support in social media discourse. It makes a methodological contribution. It combines predominately artificial intelligence with corpus linguistics analysis. It is on a large-scale dataset of anonymised diabetes-related user's posts from the Facebook platform. Log-likelihood and precision measures help with validation. A multi-method approach with Discourse Analysis helps in understanding any potential patterns. People living with Diabetes are found to employ sophisticated high-frequency patterns of device-enabled categories of purpose and content. It is with, for example, linguistic forms of Advice with stance-taking and targets such as Diabetes amongst other interactional ways. There can be uncertainty and variation of effect displayed when sharing information for support. The implications of the new theory aim at healthcare communicators, corpus linguists and with preliminary work for AI support-bots. These bots may be programmed to utilise the language patterns to support people who need them automatically.

A Corpus-based Analysis of EFL Learners' Use of Discourse Markers in Cross-cultural Communication

  • Min, Sujung
    • 영어어문교육
    • /
    • 제17권3호
    • /
    • pp.177-194
    • /
    • 2011
  • This study examines the use of discourse markers in cross-cultural communication between EFL learners in an e-learning environment. The study analyzes the use of discourse markers in a corpus of an interactive web with a bulletin board system through which college students of English at Japanese and Korean universities interacted with each other discussing the topics of local and global issues. It compares the use of discourse markers in the learners' corpus to that of a native English speakers' corpus. The results indicate that discourse markers are useful interactional devices to structure and organize discourse. EFL learners are found to display more frequent use of referentially and cognitively functional discourse markers and a relatively rare use of other markers. Native speakers are found to use a wider variety of discourse markers for different functions. Suggestions are made for using computer corpora in understanding EFL learners' language difficulties and helping them become more interactionally competent speakers.

  • PDF

Using Corpora for Studying English Grammar

  • Kwon, Heok-Seung
    • 한국영어학회지:영어학
    • /
    • 제4권1호
    • /
    • pp.61-81
    • /
    • 2004
  • This paper will look at some grammatical phenomena which will illustrate some of the questions that can be addressed with a corpus-based approach. We will use this approach to investigate the following subjects in English grammar: number ambiguity, subject-verb concord, concord with measure expressions, and (reflexive) pronoun choice in coordinated noun phrases. We will emphasize the distinctive features of the corpus-based approach, particularly its strengths in investigating language use, as opposed to traditional descriptions or prescriptions of structure in English grammar. This paper will show that a corpus-based approach has made it possible to conduct new kinds of investigations into grammar in use and to expand the scope of earlier investigations. Native speakers rarely have accurate information about frequency of use. A large representative corpus (i.e., The British National Corpus) is one of the most reliable sources of frequency information. It is important to base an analysis of language on real data rather than intuition. Any description of grammar is more complete and accurate if it is based on a body of real data.

  • PDF

언어모델 인터뷰 영향 평가를 통한 텍스트 균형 및 사이즈간의 통계 분석 (Statistical Analysis Between Size and Balance of Text Corpus by Evaluation of the effect of Interview Sentence in Language Modeling)

  • 정의정;이영직
    • 한국음향학회:학술대회논문집
    • /
    • 한국음향학회 2002년도 하계학술발표대회 논문집 제21권 1호
    • /
    • pp.87-90
    • /
    • 2002
  • This paper analyzes statistically the relationship between size and balance of text corpus by evaluation of the effect of interview sentences in language model for Korean broadcast news transcription system. Our Korean broadcast news transcription system's ultimate purpose is to recognize not interview speech, but the anchor's and reporter's speech in broadcast news show. But the gathered text corpus for constructing language model consists of interview sentences a portion of the whole, $15\%$ approximately. The characteristic of interview sentence is different from the anchor's and the reporter's in one thing or another. Therefore it disturbs the anchor and reporter oriented language modeling. In this paper, we evaluate the effect of interview sentences in language model for Korean broadcast news transcription system and analyze statistically the relationship between size and balance of text corpus by making an experiment as the same procedure according to varying the size of corpus.

  • PDF

Identifying Key Grammatical Errors of Japanese English as a Foreign Language Learners in a Learner Corpus: Toward Focused Grammar Instruction with Data-Driven Learning

  • Atsushi Mizumoto;Yoichi Watari
    • 아시아태평양코퍼스연구
    • /
    • 제4권1호
    • /
    • pp.25-42
    • /
    • 2023
  • The number of studies on data-driven learning (DDL) has increased in recent years, and DDL's overall effectiveness as an L2 (second language) teaching methodology has been reported to be high. However, the degree of its effectiveness in grammar instruction, particularly for the goal of correcting errors in L2 writing, is still unclear. To provide guidelines for focused grammar instruction with DDL in the Japanese classroom setting, we aimed to identify the typical grammatical errors made by Japanese learners in the Cambridge Learner Corpus First Certificate in English (CLC FCE) dataset. The results revealed that three error types (nouns, articles, and prepositions) should be addressed in DDL grammar instruction for Japanese English as a foreign language (EFL) learners. In light of the findings, pedagogical implications and suggestions for future DDL research and practice are discussed.

An Attempt to Measure the Familiarity of Specialized Japanese in the Nursing Care Field

  • Haihong Huang;Hiroyuki Muto;Toshiyuki Kanamaru
    • 아시아태평양코퍼스연구
    • /
    • 제4권2호
    • /
    • pp.57-74
    • /
    • 2023
  • Having a firm grasp of technical terms is essential for learners of Japanese for Specific Purposes (JSP). This research aims to analyze Japanese nursing care vocabulary based on objective corpus-based frequency and subjectively rated word familiarity. For this purpose, we constructed a text corpus centered on the National Examination for Certified Care Workers to extract nursing care keywords. The Log-Likelihood Ratio (LLR) was used as the statistical criterion for keyword identification, giving a list of 300 keywords as target words for a further word recognition survey. The survey involved 115 participants of whom 51 were certified care workers (CW group) and 64 were individuals from the general public (GP group). These participants rated the familiarity of the target keywords through crowdsourcing. Given the limited sample size, Bayesian linear mixed models were utilized to determine word familiarity rates. Our study conducted a comparative analysis of word familiarity between the CW group and the GP group, revealing key terms that are crucial for professionals but potentially unfamiliar to the general public. By focusing on these terms, instructors can bridge the knowledge gap more efficiently.