The Statistical Relationship between Linguistic Items and Corpus Size

;;

Language and Information (한국언어정보학회지:언어와정보)

Volume 7 Issue 2
/
Pages.103-115
/
2003
/
1226-7430(pISSN)

Korean Society for Language and Information (한국언어정보학회)

The Statistical Relationship between Linguistic Items and Corpus Size

코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로

양경숙 (고려대학교) ;
박병선 (고려대학교)

Published : 2003.12.01

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

Language and Information (한국언어정보학회지:언어와정보)

The Statistical Relationship between Linguistic Items and Corpus Size

코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로

Abstract

Keywords

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)