• Title/Summary/Keyword: corpora

Search Result 249, Processing Time 0.022 seconds

Usage analysis of vocabulary in Korean high school English textbooks using multiple corpora (코퍼스를 통한 고등학교 영어교과서의 어휘 분석)

  • Kim, Young-Mi;Suh, Jin-Hee
    • English Language & Literature Teaching
    • /
    • v.12 no.4
    • /
    • pp.139-157
    • /
    • 2006
  • As the Communicative Approach has become the norm in foreign language teaching, the objectives of teaching English in school have changed radically in Korea. The focus in high school English textbooks has shifted from mere mastery of structures to communicative proficiency. This paper will study five polysemous words which appear in twelve high school English textbooks used in Korea. The twelve text books are incorporated into a single corpus and analyzed to classify the usage of the selected words. Then the usage of each word was compared with that of three other corpora based sources: the BNC(British National Corpus) Sampler, ICE Singapore(International Corpus of English for Singapore) and Collins COBUILD learner's dictionary which is based on the corpus, "The Bank of English". The comparisons carried out as part of this study will demonstrate that Korean text books do not always supply the full range of meanings of polysemous words.

  • PDF

A Study on the Markup Scheme for Building the Corpora of Korean Culinary Manuscripts (한글 필사본 음식조리서 말뭉치 구축을 위한 마크업 방안 연구)

  • An, Ui-Jeong;Park, Jin-Yang;Nam, Gil-Im
    • Language and Information
    • /
    • v.12 no.2
    • /
    • pp.95-114
    • /
    • 2008
  • This study aims at establishing a markup system for 17-19th century culinary manuscripts. To achieve this aim, we, in section 2, look into various theoretical considerations regarding encoding large-scale historical corpora. In section 3, we identify and analyze the characteristics of textual theme and structure of our source text. Section 4 proposes a markup scheme based on the XML standard for bibliographical and structural markups for the corpus as well as the grammatical annotations. We show that it is highly desirable to use XML-based markup system since it is extremely powerful and flexible in its expressiveness and scalable. The markup scheme we suggest is a modified and extended version of the TEI-P5 to accommodate the textual and linguistic characteristics of premodern Korean culinary manuscripts.

  • PDF

Mining Parallel Text from the Web based on Sentence Alignment

  • Li, Bo;Liu, Juan;Zhu, Huili
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.285-292
    • /
    • 2007
  • The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

  • PDF

Semi-Automatic Annotation Tool to Build Large Dependency Tree-Tagged Corpus

  • Park, Eun-Jin;Kim, Jae-Hoon;Kim, Chang-Hyun;Kim, Young-Kill
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.385-393
    • /
    • 2007
  • Corpora annotated with lots of linguistic information are required to develop robust and statistical natural language processing systems. Building such corpora, however, is an expensive, labor-intensive, and time-consuming work. To help the work, we design and implement an annotation tool for establishing a Korean dependency tree-tagged corpus. Compared with other annotation tools, our tool is characterized by the following features: independence of applications, localization of errors, powerful error checking, instant annotated information sharing, user-friendly. Using our tool, we have annotated 100,904 Korean sentences with dependency structures. The number of annotators is 33, the average annotation time is about 4 minutes per sentence, and the total period of the annotation is 5 months. We are confident that we can have accurate and consistent annotations as well as reduced labor and time.

  • PDF

An analysis of terminological definitions (전문용어의 정의문 분석)

  • Lee Hae-Yun
    • Koreanishche Zeitschrift fur Deutsche Sprachwissenschaft
    • /
    • v.7
    • /
    • pp.145-163
    • /
    • 2003
  • In this paper, we examined various definitions of terminological definition for the extraction of terminological information from corpora. After we reviewed researches at the lexicography and at the terminology, we introduced the qualia structure of Generative Lexicon (Pustejovsky 1995) for the purpose of analyzing terminological definitions. By means of the qualia structure, we analyzed the definitions which are presented at the terminological dictionaries. As a result, we confirmed that the terminological definitions can be discomposed into 4 subtypes of qualia structure. Based on this examination, we analyzed terminological definitions of articles at a newspaper and showed the usefulness of the qualia structure at the extraction of terminological definitions from the corpora.

  • PDF

A Study on Korean Intonation Using Momel (Momel을 이용한 한국어의 억양 연구)

  • Kim, Sun-Hee;Yoo, Hyun-Ji;Hong, Hye-Jin;Lee, Ho-Young
    • MALSORI
    • /
    • no.63
    • /
    • pp.85-100
    • /
    • 2007
  • This paper aims to propose how to extract intonation patterns using Momel, a pitch stylization algorithm, and to present results of analyzing speech corpora in comparison with those in earlier researches. Two speech corpora are used: one is the sound files obtained from the K-ToBI web site, and the other consists of 80 passages pronounced by 4 speakers (2 male and 2 female). The results show that Momel provides significant pitch targets which can be labeled as H and L tones within prosodic units such as Accentual Phrase (AP) and Intonation Phrase (IP). The resulting AP patterns and IP boundary tone patterns correspond to those in earlier researches. Thus, this study will contribute to the study of intonation as well as to the development of automatic intonation labeling systems.

  • PDF

The Use of MSVM and HMM for Sentence Alignment

  • Fattah, Mohamed Abdel
    • Journal of Information Processing Systems
    • /
    • v.8 no.2
    • /
    • pp.301-314
    • /
    • 2012
  • In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.

Creation of Speech Corpora for STiLL at SiTEC (SiTEC의 STiLL관련 음성 코퍼스의 구축 현황)

  • Kim, Young-Il;Kim, Bong-Wan;Choi, Dae-Lim;Lee, Kwang-Hyun;Jeong, Eun-Soon;Lee, Yong-Ju
    • Proceedings of the KSPS conference
    • /
    • 2005.11a
    • /
    • pp.13-16
    • /
    • 2005
  • As language learning that utilizes speech and information processing technology is getting popular. Speech Information Technology & Promotion Center(SiTEC) has created and is distributing speech corpora for STiLL in order to support basic research and development of products. We will introduce the corpus for Korean and those for English which we have created and are distributing.

  • PDF

The Statistical Relationship between Linguistic Items and Corpus Size (코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로)

  • 양경숙;박병선
    • Language and Information
    • /
    • v.7 no.2
    • /
    • pp.103-115
    • /
    • 2003
  • In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

  • PDF

A Rule-Based Analysis from Raw Korean Text to Morphologically Annotated Corpora

  • Lee, Ki-Yong;Markus Schulze
    • Language and Information
    • /
    • v.6 no.2
    • /
    • pp.105-128
    • /
    • 2002
  • Morphologically annotated corpora are the basis for many tasks of computational linguistics. Most current approaches use statistically driven methods of morphological analysis, that provide just POS-tags. While this is sufficient for some applications, a rule-based full morphological analysis also yielding lemmatization and segmentation is needed for many others. This work thus aims at 〔1〕 introducing a rule-based Korean morphological analyzer called Kormoran based on the principle of linearity that prohibits any combination of left-to-right or right-to-left analysis or backtracking and then at 〔2〕 showing how it on be used as a POS-tagger by adopting an ordinary technique of preprocessing and also by filtering out irrelevant morpho-syntactic information in analyzed feature structures. It is shown that, besides providing a basis for subsequent syntactic or semantic processing, full morphological analyzers like Kormoran have the greater power of resolving ambiguities than simple POS-taggers. The focus of our present analysis is on Korean text.

  • PDF