• Title/Summary/Keyword: corpus size

Search Result 117, Processing Time 0.034 seconds

The Statistical Relationship between Linguistic Items and Corpus Size (코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로)

  • 양경숙;박병선
    • Language and Information
    • /
    • v.7 no.2
    • /
    • pp.103-115
    • /
    • 2003
  • In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

  • PDF

Statistical Analysis Between Size and Balance of Text Corpus by Evaluation of the effect of Interview Sentence in Language Modeling (언어모델 인터뷰 영향 평가를 통한 텍스트 균형 및 사이즈간의 통계 분석)

  • Jung Eui-Jung;Lee Youngjik
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • spring
    • /
    • pp.87-90
    • /
    • 2002
  • This paper analyzes statistically the relationship between size and balance of text corpus by evaluation of the effect of interview sentences in language model for Korean broadcast news transcription system. Our Korean broadcast news transcription system's ultimate purpose is to recognize not interview speech, but the anchor's and reporter's speech in broadcast news show. But the gathered text corpus for constructing language model consists of interview sentences a portion of the whole, $15\%$ approximately. The characteristic of interview sentence is different from the anchor's and the reporter's in one thing or another. Therefore it disturbs the anchor and reporter oriented language modeling. In this paper, we evaluate the effect of interview sentences in language model for Korean broadcast news transcription system and analyze statistically the relationship between size and balance of text corpus by making an experiment as the same procedure according to varying the size of corpus.

  • PDF

An Algorithm for Predicting the Relationship between Lemmas and Corpus Size

  • Yang, Dan-Hee;Gomez, Pascual Cantos;Song, Man-Suk
    • ETRI Journal
    • /
    • v.22 no.2
    • /
    • pp.20-31
    • /
    • 2000
  • Much research on natural language processing (NLP), computational linguistics and lexicography has relied and depended on linguistic corpora. In recent years, many organizations around the world have been constructing their own large corporal to achieve corpus representativeness and/or linguistic comprehensiveness. However, there is no reliable guideline as to how large machine readable corpus resources should be compiled to develop practical NLP software and/or complete dictionaries for humans and computational use. In order to shed some new light on this issue, we shall reveal the flaws of several previous researches aiming to predict corpus size, especially those using pure regression or curve-fitting methods. To overcome these flaws, we shall contrive a new mathematical tool: a piecewise curve-fitting algorithm, and next, suggest how to determine the tolerance error of the algorithm for good prediction, using a specific corpus. Finally, we shall illustrate experimentally that the algorithm presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, compiling methodology, corpus representativeness and linguistic comprehensiveness.

  • PDF

Vocabulary Coverage Improvement for Embedded Continuous Speech Recognition Using Part-of-Speech Tagged Corpus (품사 부착 말뭉치를 이용한 임베디드용 연속음성인식의 어휘 적용률 개선)

  • Lim, Min-Kyu;Kim, Kwang-Ho;Kim, Ji-Hwan
    • MALSORI
    • /
    • no.67
    • /
    • pp.181-193
    • /
    • 2008
  • In this paper, we propose a vocabulary coverage improvement method for embedded continuous speech recognition (CSR) using a part-of-speech (POS) tagged corpus. We investigate 152 POS tags defined in Lancaster-Oslo-Bergen (LOB) corpus and word-POS tag pairs. We derive a new vocabulary through word addition. Words paired with some POS tags have to be included in vocabularies with any size, but the vocabulary inclusion of words paired with other POS tags varies based on the target size of vocabulary. The 152 POS tags are categorized according to whether the word addition is dependent of the size of the vocabulary. Using expert knowledge, we classify POS tags first, and then apply different ways of word addition based on the POS tags paired with the words. The performance of the proposed method is measured in terms of coverage and is compared with those of vocabularies with the same size (5,000 words) derived from frequency lists. The coverage of the proposed method is measured as 95.18% for the test short message service (SMS) text corpus, while those of the conventional vocabularies cover only 93.19% and 91.82% of words appeared in the same SMS text corpus.

  • PDF

Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category

  • Park, So-Young;Chang, Juno;Kihl, Taesuk
    • Journal of information and communication convergence engineering
    • /
    • v.11 no.4
    • /
    • pp.268-273
    • /
    • 2013
  • In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.

The significance of corpus callosal size in the estimation of neurologically abnormal infants (신경학적인 결함이 있었던 영아의 예후 판단에서 뇌량 크기의 중요성)

  • Yu, Seung Taek;Lee, Chang Woo
    • Clinical and Experimental Pediatrics
    • /
    • v.51 no.11
    • /
    • pp.1205-1210
    • /
    • 2008
  • Purpose : The development of the corpus callosum occupies the entire period of cerebral formation. The myelination pattern on magnetic resonance imaging (MRI) is very useful to evaluate neurologic development and to predict neurologic outcome in high risk infants. The thickness of the corpus callosum is believed to depend on the myelination process. It is possible to calculate the length and thickness of the corpus callosum on MRI. Thus, we can quantitatively evaluate the development of the corpus callosum. We investigated the clinical significance of measuring various portions of the corpus callosum in neonate with neurologic disorders such as hypoxic brain damage and seizure disorder. Methods : Forty-two neonates were evaluated by brain MRI. We measured the size of the genu, body, transitional zone, splenium, and length of the corpus callosum. Each measurement was divided by the total length of the corpus callosum to obtain its corrected size. The ratio of corpus callosal length and the anteroposterior diameter of the brain was also measured. Results : There was no statistical significance in the sample size of each part of the corpus callosum. However, the corrected size or the ratio of body of the corpus callosum correlated with periventricular leukomalacia and hypoxic ischemic encephalopathy. Conclusion : The abnormal size of the corpus callosum showed a good correlation with periventricular leukomalacia and hypoxic ischemic encephalopathy in neonates. We can predict clinical neurological problems by estimation of the corpus callosum in the neonatal period.

Morphological Observations of Ovaries in Relation to Infertility in Slaughtered Cows in Kyungnam Province 1. Appearance of follicles and corpus luteums in cow ovaries (경남지방의 도태우에 불임과 관련된 난소의 형태학적 관찰 1. 난포와 황체의 출현에 대하여)

  • 양재훈;표병민;서득록;고필옥;강정부;김종섭;곽수동
    • Journal of Veterinary Clinics
    • /
    • v.19 no.2
    • /
    • pp.147-152
    • /
    • 2002
  • Ovaries from total 192 slaughtered cows, 154 Korean native cows and 38 dairy cows were collected during the slaughtering process in Kimhae, Changyoung and Yangsan abattoirs in Kyungnam province from January 2001 to January 2002. Rates of pregnant and non-pregnant and ovarian findings were invested. Rates of pregnant cows in 192 slaughtered cows were 12.5% (24 cows) and in difference of cow breeds, 11.0% (17 cows) in 154 Korean native cows and 18.4% (7 cows) in 38 dairy cows from total 192 cows, respectively. Ages of fetuses in pregnant Korean native cows were mostly less than 4 months and ages of fetuses in dairy cows were mostly about 7-8 months. Cows which each diameter of follicles and corpus luteums in same cow was more than 5-6 mm in diameter were 69.8% (134 cows) in total 192 slaughtered cows and in difference of cow breeds, 64.7% (11 cows) in 17 Korean native cows and 57.1% (4 cows) in 7 dairy cows. Mean diameter of foliicles and corpus luteums in Korean native cows are 13.7$\pm$5.6$\times$ 11.2$\pm$4.6mm and 17.5$\pm$4.6$\times$14.6$\pm$4.0 mm in non-pregnat cows, and are 11.0$\pm$4.8$\times$9.1 $\pm$ 2.6mm and 21.2$\pm$2.9$\times$18.3$\pm$ 2.7 mm in pregnant cows, respectively. Mean diameter of follicles and corpus luteums in dairy cows are 15.8$\pm$7.1 $\times$ 14.3$\pm$ 6.0 mm and 20.3$\pm$5.9$\times$16.9$\pm$ 5.8 mm in non-pregnant cows, and are 10.1 $\pm$ 3.0$\times$9.2$\pm$2.3 mm and 23.0$\pm$ 1.7$\times$20.1 $\pm$ 1.3 mm in pregnant cows, respectivley. The above findings indicate that the co-appearance rate of follicles and corpus luteums in same cows are higher in both pregnant and non-pregnant cows. Compared in pregnant and non-pregnant cow ovaries, mean size of follicles are smaller in pregnant cows but size of corpus luteums are more larger in pregnant cows than in non-pregnant cows. Correlation of the follicle size (Y) and corpus luteum size (X) in same cows developed each other in inversive size. Those correlative formulas appeared to be Y = -0.2022X+17.175 in Korean native cows and Y= -0.5754 X+24.153 in dairy cows.

Morphologic Assessment of Corpus Callosum in the Patient of Alzheimer Disease using Magnetic Resonance Imaging

  • Seoung, Youl-Hun;Choe, Bo-Young
    • Journal of the Korean Magnetic Resonance Society
    • /
    • v.13 no.2
    • /
    • pp.84-95
    • /
    • 2009
  • The purpose of this study was to evaluate the usefulness of the measurement of corpus callosum (CC) size in the Alzheimer patient by using magnetic resonance (MR) midsagittal image. We performed MR scanning in 20 normal high age group, and in 20 mild cognitive impairment (MCI) group, and in 20 Alzheimer disease (AD) group. The following parameters were employed in AD group: TRITE/FA 6650ms/66ms/$90^{\circ}$, NEX 2, Thickness/Gap 2/0, FOV 220mm. The magnetic field strength was used at 3.0 Tesla. We selected midsagittal image of the brain by using view forum program, measured CC size, which were anteroposterior length, diameter of genu, body, narrowing portion, and splenium. The present study demonstrates that CC size of Alzheimer disease can be useful for clinical assessment concerning the diameter of genu, body, and splenium.

Relationship between corpus luteum size as determined by ultrasonography and milk progesterone concentration during the estrous cycle in dairy cows (젖소에서 발정주기중 초음파 진단장치로 측정된 황체의 크기와 progesterone 농도와의 관계)

  • Son, Chang-ho;Kang, Byong-kyu;Choi, Han-sun
    • Korean Journal of Veterinary Research
    • /
    • v.35 no.4
    • /
    • pp.833-841
    • /
    • 1995
  • Ultrasonography was used to measure the corpus luteum area for determining the relationships between corpus luteum size and milk progesterone concentration during the estrous cycle in 16 dairy cows. Cows were classified retrospectively into cows that had corpus luteum with(n=P) and without(n=7) cavity. Ultrasound examination and collection of milk samples for progesterone assay were performed with 2 day intervals from Days 0 to 12, and then daily from Day 14 to the day of the next ovulation. The means for corpus luteum area and for milk progesterone concentration during the estrous cycle were not significantly different between cows that had corpus luteum with and without cavity. The correlation coefficients between corpus luteum area and milk progesterone concentration during luteal development (Days 2 to 8) were 0.71(p<0.0001) and 0.74(p<0.0001) for corpus luteum with and without cavity, respectively, during luteal regression(Days -6 to 0 relative to the next ovulation) 0.73(p<0.0001) and 0.76(p<0.0001), respectively. The correlation coefficients combined fur both stages of estrous cycle and both luteal statuses were 0.70(p<0.0001). These results indicate that corpus luteum area is significantly correlated to milk progesterone concentration, and ultrasonographic assessment of the corpus luteum is a reliable method fur estimating peripheral progesterone concentrations during the estrous cycle in cows.

  • PDF

Development of Differential Diagnosis and Treatment Method of Reproductive Disorders Using Ultrasonography in Cows III. Differential Diagnosis between Developing and Regressing Corpus Luteum (초음파검사에 의한 소의 번식장애 감별진단 및 치료법 개발 III. 발육황체와 퇴행황체의 감별)

  • 손창호;강병규;최한선;임원호;강현구;오기석;신종봉;서국현
    • Journal of Veterinary Clinics
    • /
    • v.16 no.1
    • /
    • pp.118-127
    • /
    • 1999
  • The aim of this study was to establish the method of differential diagnosis between developing and regressing corpus luteum in cows. Plasma progesterone (P$_4$) concentrations were determined by radioimmunoassay in slaughtered, cycling and pregnant cows. Ultrasonography was used to measure the corpus luteum size and histogram values for determining the correlationships between corpus luteum area or histogram values and plasma P$_4$ concentrations. The corpora lutea were monitored in vitro (water-bath scanning) by using ultrasonography with 7.5 MHz linear-array transducer in 196 slaughtered cows. The correlation coefficient between corpus luteum area and plasma P$_4$ concentrations was 0.46 (p<0.01), and between histogram values and plasma P$_4$ concentrations was -0.44 (p<0.01), respectively. The corpora lutea were monitored by ultrasonography with 5.0 MHz linear-array transrectal transducer in 188 cycling and 30 pregnant cows. The corpus luteum areas and plasma P4 concentrations were significantly different between regressing and other corpora lutea (p<0.01), and also histogram values were significantly different between regressing and developing corpola lutea (p<0.01). The correlation coefficients between corpus luteum areas and plasma P$_4$ concentrations were 0.76 (p<0.01), 0.71 (p<0.01), 0.65 (p<0.05) and 0.68 (p<0.05), and between histogram values and plasma P$_4$ concentrations were 0.74 (p<0.05), 0.71 (p<0.01), -0.52 (p<0.05) and 0.65 (p<0.05) in developing, functional, regressing and pregnant corpora lutea, respectively. These results indicate that corpus luteum areas and plasma P$_4$ concentrations were highly correlated in all stages of corpus luteum. The histogram values and plasma P$_4$ concentrations were positive correlated in developing, functional and pregnant corpora lutea, but negative correlated in regressing corpus luteum. Therefore, the measurement of corpus luteum area and histogram value by ultrasonography is reliable method for the assessment of luteal function, specially developing and regressing corpus luteum.

  • PDF