• 제목/요약/키워드: corpora

검색결과 249건 처리시간 0.022초

Vocabulary Analyzer Based on CEFR-J Wordlist for Self-Reflection (VACSR) Version 2

  • Yukiko Ohashi;Noriaki Katagiri;Takao Oshikiri
    • 아시아태평양코퍼스연구
    • /
    • 제4권2호
    • /
    • pp.75-87
    • /
    • 2023
  • This paper presents a revised version of the vocabulary analyzer for self-reflection (VACSR), called VACSR v.2.0. The initial version of the VACSR automatically analyzes the occurrences and the level of vocabulary items in the transcribed texts, indicating the frequency, the unused vocabulary items, and those not belonging to either scale. However, it overlooked words with multiple parts of speech due to their identical headword representations. It also needed to provide more explanatory result tables from different corpora. VACSR v.2.0 overcomes the limitations of its predecessor. First, unlike VACSR v.1, VACSR v.2.0 distinguishes words that are different parts of speech by syntactic parsing using Stanza, an open-source Python library. It enables the categorization of the same lexical items with multiple parts of speech. Second, VACSR v.2.0 overcomes the limited clarity of VACSR v.1 by providing precise result output tables. The updated software compares the occurrence of vocabulary items included in classroom corpora for each level of the Common European Framework of Reference-Japan (CEFR-J) wordlist. A pilot study utilizing VACSR v.2.0 showed that, after converting two English classes taught by a preservice English teacher into corpora, the headwords used mostly corresponded to CEFR-J level A1. In practice, VACSR v.2.0 will promote users' reflection on their vocabulary usage and can be applied to teacher training.

인후측선적출과 이상착색용과의 관계에 관한 연구 (Studies on the Relation between Allatectomy (picking out of corpora allata) and abnormal colouring pupa, Bombyx mori L.)

  • 윤종관
    • 한국잠사곤충학회지
    • /
    • 제20권2호
    • /
    • pp.15-19
    • /
    • 1978
  • 곤충의 일반적인 피부조직의 색소는 물론 용체 특유의 색채에 관여하는 호르몬과 변태에 관여하는 인후측선호르몬 및 전포대상선호르몬의 연관성을 분명히 하려는 목적에서 4령기에는 포식후 48시간이 경과한 후부터 약 12시간 간격으로 5 회에 걸친 처리구를 설정하였고 5령기에는 면기후 72시간이 경과한 후부터 약 12 시간 간격으로 6 회에 걸쳐 4·5 령기 공히 춘·추추잠기에 걸친 처리구를 설정하였으며 상응기이후 화용전에 있어서는 7 시간 간격으로 5회에 걸쳐 인후측선을 적출하므로써 용색변화에 다음과 같은 결과를 얻었다. 1) 4령기초기 적출에서는 3면잠이 많이 출현했고 후기에는 4면잠이 많이 출현했다. 2) 3면잠 및 4면잠 공히 대조구에 비하여 용체 이상착색을 인정할 수 있으며 그 정도는 3면잠에 비하여 4면잠에 있어서 기하였다. 3) 대조구는 대부분 정상착색이고 1 부가 경도의 이상착색이었다. 4) Table 1 과 Table 2 에서 보는 바와 같이 추잠기에 있어서 처리구의 경사수가 많았다. 5) 5영기 및 상족기이후 화용전에서도 인후측선 적출용체의 차색에서 그 영향력이 있음을 인정할 수 있다. 6) 결론적으로 인후측선호르몬은 용색변화에 밀접한 관련성이 있다.

  • PDF

Automatic Acquisition of Paraphrases Using Bilingual Dependency Relations

  • Hwang, Young-Sook;Kim, Young-Kil
    • ETRI Journal
    • /
    • 제30권1호
    • /
    • pp.155-157
    • /
    • 2008
  • This letter introduces a new method to automatically acquire paraphrases using bilingual corpora. It utilizes the bilingual dependency relations obtained by projecting a monolingual dependency parse onto the other language's sentence based on statistical alignment techniques. Since the proposed paraphrasing method can clearly disambiguate the sense of the original phrases using the bilingual context of dependency relations, it would be possible to obtain interchangeable paraphrases under a given context. Through experiments with parallel corpora of Korean and English language pairs, we demonstrate that our method effectively extracts paraphrases with high precision, achieving success rates of 94.3% and 84.6%, respectively, for Korean and English.

  • PDF

Robust Syntactic Annotation of Corpora and Memory-Based Parsing

  • Hinrichs, Erhard W.
    • 한국언어정보학회:학술대회논문집
    • /
    • 한국언어정보학회 2002년도 Language, Information, and Computation Proceedings of The 16th Pacific Asia Conference
    • /
    • pp.1-1
    • /
    • 2002
  • This talk provides an overview of current work in my research group on the syntactic annotation of the T bingen corpus of spoken German and of the German Reference Corpus (Deutsches Referenzkorpus: DEREKO) of written texts. Morpho-syntactic and syntactic annotation as well as annotation of function-argument structure for these corpora is performed automatically by a hybrid architecture that combines robust symbolic parsing with finite-state methods ("chunk parsing" in the sense Abney) with memory-based parsing (in the sense of Daelemans). The resulting robust annotations can be used by theoretical linguists, who lire interested in large-scale, empirical data, and by computational linguists, who are in need of training material for a wide range of language technology applications. To aid retrieval of annotated trees from the treebank, a query tool VIQTORYA with a graphical user interface and a logic-based query language has been developed. VIQTORYA allows users to query the treebanks for linguistic structures at the word level, at the level of individual phrases, and at the clausal level.

  • PDF

한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법 (A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus)

  • 박정열;차정원
    • 대한음성학회지:말소리
    • /
    • 제68권
    • /
    • pp.95-114
    • /
    • 2008
  • The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.

  • PDF

Classifying Articles in Chinese Wikipedia with Fine-Grained Named Entity Types

  • Zhou, Jie;Li, Bicheng;Tang, Yongwang
    • Journal of Computing Science and Engineering
    • /
    • 제8권3호
    • /
    • pp.137-148
    • /
    • 2014
  • Named entity classification of Wikipedia articles is a fundamental research area that can be used to automatically build large-scale corpora of named entity recognition or to support other entity processing, such as entity linking, as auxiliary tasks. This paper describes a method of classifying named entities in Chinese Wikipedia with fine-grained types. We considered multi-faceted information in Chinese Wikipedia to construct four feature sets, designed different feature selection methods for each feature, and fused different features with a vector space using different strategies. Experimental results show that the explored feature sets and their combination can effectively improve the performance of named entity classification.

Using Corpora for Studying English Grammar

  • Kwon, Heok-Seung
    • 한국영어학회지:영어학
    • /
    • 제4권1호
    • /
    • pp.61-81
    • /
    • 2004
  • This paper will look at some grammatical phenomena which will illustrate some of the questions that can be addressed with a corpus-based approach. We will use this approach to investigate the following subjects in English grammar: number ambiguity, subject-verb concord, concord with measure expressions, and (reflexive) pronoun choice in coordinated noun phrases. We will emphasize the distinctive features of the corpus-based approach, particularly its strengths in investigating language use, as opposed to traditional descriptions or prescriptions of structure in English grammar. This paper will show that a corpus-based approach has made it possible to conduct new kinds of investigations into grammar in use and to expand the scope of earlier investigations. Native speakers rarely have accurate information about frequency of use. A large representative corpus (i.e., The British National Corpus) is one of the most reliable sources of frequency information. It is important to base an analysis of language on real data rather than intuition. Any description of grammar is more complete and accurate if it is based on a body of real data.

  • PDF