• 제목/요약/키워드: Corpus statistics

검색결과 39건 처리시간 0.033초

초등학교 교과서의 어휘 통계 분석 연구 : 한국어 세종 코퍼스와의 비교를 중심으로 (The Study Of Lexical Statistics Analysis For Elementary School Textbook : Focusing On Comparing The SEJONG Corpus In Korean)

  • 유원희;임희석
    • 컴퓨터교육학회논문지
    • /
    • 제18권1호
    • /
    • pp.99-108
    • /
    • 2015
  • 본 논문에서는 초등학교 교과서 말뭉치를 구축하고, 초등교과서에서 나타나는 어휘들에 대하여 통계분석을 실시하였다. 또한 초등 교과서가 일반생활에서 사용하는 어휘와 얼마나 유사한지를 살펴보기 위하여 스피어만 상관관계 분석을 실시하였다. 연구결과로 초등교과서의 말뭉치 구축 모습과 실제 예시를 보였고, 상관관계 분석을 통하여 초등교과서와 일반 말뭉치와의 상관관계를 수치적으로 보였다.

A Study on the Diachronic Evolution of Ancient Chinese Vocabulary Based on a Large-Scale Rough Annotated Corpus

  • Yuan, Yiguo;Li, Bin
    • 아시아태평양코퍼스연구
    • /
    • 제2권2호
    • /
    • pp.31-41
    • /
    • 2021
  • This paper makes a quantitative analysis of the diachronic evolution of ancient Chinese vocabulary by constructing and counting a large-scale rough annotated corpus. The texts from Si Ku Quan Shu (a collection of Chinese ancient books) are automatically segmented to obtain ancient Chinese vocabulary with time information, which is used to the statistics on word frequency, standardized type/token ratio and proportion of monosyllabic words and dissyllabic words. Through data analysis, this study has the following four findings. Firstly, the high-frequency words in ancient Chinese are stable to a certain extent. Secondly, there is no obvious dissyllabic trend in ancient Chinese vocabulary. Moreover, the Northern and Southern Dynasties (420-589 AD) and Yuan Dynasty (1271-1368 AD) are probably the two periods with the most abundant vocabulary in ancient Chinese. Finally, the unique words with high frequency in each dynasty are mainly official titles with real power. These findings break away from qualitative methods used in traditional researches on Chinese language history and instead uses quantitative methods to draw macroscopic conclusions from large-scale corpus.

Enhancement of a language model using two separate corpora of distinct characteristics

  • 조세형;정태선
    • 한국지능시스템학회논문지
    • /
    • 제14권3호
    • /
    • pp.357-362
    • /
    • 2004
  • 언어 모델은 음성 인식이나 필기체 문자 인식 등에서 다음 단어를 예측함으로써 인식률을 높이게 된다. 그러나 언어 모델은 그 도메인에 따라 모두 다르며 충분한 분량의 말뭉치를 수집하는 것이 거의 불가능하다. 본 논문에서는 N그램 방식의 언어모델을 구축함에 있어서 크기가 제한적인 말뭉치의 한계를 극복하기 위하여 두개의 말뭉치, 즉 소규모의 구어체 말뭉치와 대규모의 문어체 말뭉치의 통계를 이용하는 방법을 제시한다. 이 이론을 검증하기 위하여 수십만 단어 규모의 방송용 말뭉치에 수백만 이상의 신문 말뭉치를 결합하여 방송 스크립트에 대한 퍼플렉시티를 30% 향상시킨 결과를 획득하였다.

웹 번역문서 판별과 병렬 말뭉치 구축 (Judging Translated Web Document & Constructing Bilingual Corpus)

  • Jee-hyung, Kim;Yill-byung, Lee
    • 한국정보과학회:학술대회논문집
    • /
    • 한국정보과학회 2004년도 가을 학술발표논문집 Vol.31 No.2 (1)
    • /
    • pp.787-789
    • /
    • 2004
  • People frequently feel the need of a general searching tool that frees from language barrier when they find information through the internet. Therefore, it is necessary to have a multilingual parallel corpus to search with a word that includes a search keyword and has a corresponding word in another language, Multilingual parallel corpus can be built and reused effectively through the several processes which are judgment of the web documents, sentence alignment and word alignment. To build a multilingual parallel corpus, multi-lingual dictionary should be constructed in each language and HTML should be simplified. And by understanding the meaning and the statistics of document structure, judgment on translated web documents will be made and the searched web pages will be aligned in sentence unit.

  • PDF

발음사전 표제어중의 음소의 통계적 성질-음성 DB용 단어선정을 위하여- (On the statistics of Korean Phonetic Dictionary - Basic Survey to make corpus of Korean Speech DB -)

  • 이용주;김경태;조철우;이태원
    • 대한전기학회:학술대회논문집
    • /
    • 대한전기학회 1987년도 전기.전자공학 학술대회 논문집(II)
    • /
    • pp.1606-1609
    • /
    • 1987
  • Statistical information about spoken Korean was obtained. The data are the results of analyzing the Korean phonetic dictionary. This is one of the basic survey to make phoneme ballanced corpus of Korean Speech Data Base (KSDB).

  • PDF

품사태킹을 위한 어휘문맥 의존규칙의 말뭉치기반 중의성주도 학습 (Corpus-Based Ambiguity-Driven Learning of Context- Dependent Lexical Rules for Part-of-Speech Tagging)

  • 이상주;류원호;김진동;임해창
    • 한국정보과학회논문지:소프트웨어및응용
    • /
    • 제26권1호
    • /
    • pp.178-178
    • /
    • 1999
  • Most stochastic taggers can not resolve some morphological ambiguities that can be resolved only by referring to lexical contexts because they use only contextual probabilities based ontag n-grams and lexical probabilities. Existing lexical rules are effective for resolving such ambiguitiesbecause they can refer to lexical contexts. However, they have two limitations. One is that humanexperts tend to make erroneous rules because they are deterministic rules. Another is that it is hardand time-consuming to acquire rules because they should be manually acquired. In this paper, wepropose context-dependent lexical rules, which are lexical rules based on the statistics of a taggedcorpus, and an ambiguity-driven teaming method, which is the method of automatically acquiring theproposed rules from a tagged corpus. By using the proposed rules, the proposed tagger can partiallyannotate an unseen corpus with high accuracy because it is a kind of memorizing tagger that canannotate a training corpus with 100% accuracy. So, the proposed tagger is useful to improve theaccuracy of a stochastic tagger. And also, it is effectively used for detecting and correcting taggingerrors in a manually tagged corpus. Moreover, the experimental results show that the proposed methodis also effective for English part-of-speech tagging.

The fundamental frequency (f0) distribution of American speakers in a spontaneous speech corpus

  • Byunggon Yang
    • 말소리와 음성과학
    • /
    • 제16권1호
    • /
    • pp.11-16
    • /
    • 2024
  • The fundamental frequency (f0), representing an acoustic measure of vocal fold vibration, serves as an indicator of the speaker's emotional state and language-specific pattern in daily conversations. This study aimed to examine the f0 distribution in an English corpus of spontaneous speech, establishing normative data for American speakers. The corpus involved 40 participants engaging in free discussions on daily activities and personal viewpoints. Using Praat, f0 values were collected filtering outliers after removing nonspeech sounds and interviewer voices. Statistical analyses were performed with R. Results indicated a median f0 value of 145 Hz for all the speakers. The f0 values for all speakers exhibited a right-skewed, pointy distribution within a frequency range of 216 Hz from 75 Hz to 339 Hz. The female f0 range was wider than that of males, with a median of 113 Hz for males and 181 Hz for females. This spontaneous speech corpus provides valuable insights for linguists into f0 variation among individuals or groups in a language. Further research is encouraged to develop analytical and statistical measures for establishing reliable f0 standards for the general population.

Data Mining Research on Maehwado Painting Poetry in the Early Joseon Dynasty

  • Haeyoung Park;Younghoon An
    • Journal of Information Processing Systems
    • /
    • 제19권4호
    • /
    • pp.474-482
    • /
    • 2023
  • Data mining is a technique for extracting valuable information from vast amounts of data by analyzing statistical and mathematical operations, rules, and relationships. In this study, we employed data mining technology to analyze the data concerning the painting poetry of Maehwado (plum blossom paintings) from the early Joseon Dynasty. The data was extracted from the Hanguk Munjip Chonggan (Korean Literary Collections in Classical Chinese) in the Hanguk Gojeon Jonghap database (Korea Classics DB). Using computer information processing techniques, we carried out web scraping and classification of the painting poetry from the Hanguk Munjip Chonggan. Subsequently, we narrowed down our focus to the painting poetry specifically related to Maehwado in the early Joseon Dynasty. Based on this, refined dataset, we conducted an in-depth analysis and interpretation of the text data at the syllable corpus level. As a result, we found a direct correlation between the corpus statistics for each syllable in Maehwado painting poetry and the symbolic meaning of plum blossoms.

Ambiguity Resolution in Chinese Word Segmentation

  • Maosong, Sun;T'sou, Benjamin-K.
    • 한국언어정보학회:학술대회논문집
    • /
    • 한국언어정보학회 1995년도 Language, Information and Computation = Proceedings of the 10th Pacific Asia Conference, Hong Kong
    • /
    • pp.121-126
    • /
    • 1995
  • A new method for Chinese word segmentation named Conditional F'||'&'||'BMM (Forward and Backward Maximal Matching) which incorporates both bigram statistics (ie., mutual infonllation and difference of t-test between Chinese characters) and linguistic rules for ambiguity resolution is proposed in this paper The key characteristics of this model are the use of: (i) statistics which can be automatically derived from any raw corpus, (ii) a rule base for disambiguation with consistency and controlled size to be built up in a systematic way.

  • PDF

The f0 distribution of Korean speakers in a spontaneous speech corpus

  • Yang, Byunggon
    • 말소리와 음성과학
    • /
    • 제13권3호
    • /
    • pp.31-37
    • /
    • 2021
  • The fundamental frequency, or f0, is an important acoustic measure in the prosody of human speech. The current study examined the f0 distribution of a corpus of spontaneous speech in order to provide normative data for Korean speakers. The corpus consists of 40 speakers talking freely about their daily activities and their personal views. Praat scripts were created to collect f0 values, and a majority of obvious errors were corrected manually by watching and listening to the f0 contour on a narrow-band spectrogram. Statistical analyses of the f0 distribution were conducted using R. The results showed that the f0 values of all the Korean speakers were right-skewed, with a pointy distribution. The speakers produced spontaneous speech within a frequency range of 274 Hz (from 65 Hz to 339 Hz), excluding statistical outliers. The mode of the total f0 data was 102 Hz. The female f0 range, with a bimodal distribution, appeared wider than that of the male group. Regression analyses based on age and f0 values yielded negligible R-squared values. As the mode of an individual speaker could be predicted from the median, either the median or mode could serve as a good reference for the individual f0 range. Finally, an analysis of the continuous f0 points of intonational phrases revealed that the initial and final segments of the phrases yielded several f0 measurement errors. From these results, we conclude that an examination of a spontaneous speech corpus can provide linguists with useful measures to generalize acoustic properties of f0 variability in a language by an individual or groups. Further studies would be desirable of the use of statistical measures to secure reliable f0 values of individual speakers.