• 제목/요약/키워드: corpus size

검색결과 117건 처리시간 0.026초

코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로 (The Statistical Relationship between Linguistic Items and Corpus Size)

  • 양경숙;박병선
    • 한국언어정보학회지:언어와정보
    • /
    • 제7권2호
    • /
    • pp.103-115
    • /
    • 2003
  • In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

  • PDF

언어모델 인터뷰 영향 평가를 통한 텍스트 균형 및 사이즈간의 통계 분석 (Statistical Analysis Between Size and Balance of Text Corpus by Evaluation of the effect of Interview Sentence in Language Modeling)

  • 정의정;이영직
    • 한국음향학회:학술대회논문집
    • /
    • 한국음향학회 2002년도 하계학술발표대회 논문집 제21권 1호
    • /
    • pp.87-90
    • /
    • 2002
  • This paper analyzes statistically the relationship between size and balance of text corpus by evaluation of the effect of interview sentences in language model for Korean broadcast news transcription system. Our Korean broadcast news transcription system's ultimate purpose is to recognize not interview speech, but the anchor's and reporter's speech in broadcast news show. But the gathered text corpus for constructing language model consists of interview sentences a portion of the whole, $15\%$ approximately. The characteristic of interview sentence is different from the anchor's and the reporter's in one thing or another. Therefore it disturbs the anchor and reporter oriented language modeling. In this paper, we evaluate the effect of interview sentences in language model for Korean broadcast news transcription system and analyze statistically the relationship between size and balance of text corpus by making an experiment as the same procedure according to varying the size of corpus.

  • PDF

An Algorithm for Predicting the Relationship between Lemmas and Corpus Size

  • Yang, Dan-Hee;Gomez, Pascual Cantos;Song, Man-Suk
    • ETRI Journal
    • /
    • 제22권2호
    • /
    • pp.20-31
    • /
    • 2000
  • Much research on natural language processing (NLP), computational linguistics and lexicography has relied and depended on linguistic corpora. In recent years, many organizations around the world have been constructing their own large corporal to achieve corpus representativeness and/or linguistic comprehensiveness. However, there is no reliable guideline as to how large machine readable corpus resources should be compiled to develop practical NLP software and/or complete dictionaries for humans and computational use. In order to shed some new light on this issue, we shall reveal the flaws of several previous researches aiming to predict corpus size, especially those using pure regression or curve-fitting methods. To overcome these flaws, we shall contrive a new mathematical tool: a piecewise curve-fitting algorithm, and next, suggest how to determine the tolerance error of the algorithm for good prediction, using a specific corpus. Finally, we shall illustrate experimentally that the algorithm presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, compiling methodology, corpus representativeness and linguistic comprehensiveness.

  • PDF

품사 부착 말뭉치를 이용한 임베디드용 연속음성인식의 어휘 적용률 개선 (Vocabulary Coverage Improvement for Embedded Continuous Speech Recognition Using Part-of-Speech Tagged Corpus)

  • 임민규;김광호;김지환
    • 대한음성학회지:말소리
    • /
    • 제67호
    • /
    • pp.181-193
    • /
    • 2008
  • In this paper, we propose a vocabulary coverage improvement method for embedded continuous speech recognition (CSR) using a part-of-speech (POS) tagged corpus. We investigate 152 POS tags defined in Lancaster-Oslo-Bergen (LOB) corpus and word-POS tag pairs. We derive a new vocabulary through word addition. Words paired with some POS tags have to be included in vocabularies with any size, but the vocabulary inclusion of words paired with other POS tags varies based on the target size of vocabulary. The 152 POS tags are categorized according to whether the word addition is dependent of the size of the vocabulary. Using expert knowledge, we classify POS tags first, and then apply different ways of word addition based on the POS tags paired with the words. The performance of the proposed method is measured in terms of coverage and is compared with those of vocabularies with the same size (5,000 words) derived from frequency lists. The coverage of the proposed method is measured as 95.18% for the test short message service (SMS) text corpus, while those of the conventional vocabularies cover only 93.19% and 91.82% of words appeared in the same SMS text corpus.

  • PDF

Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category

  • Park, So-Young;Chang, Juno;Kihl, Taesuk
    • Journal of information and communication convergence engineering
    • /
    • 제11권4호
    • /
    • pp.268-273
    • /
    • 2013
  • In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.

신경학적인 결함이 있었던 영아의 예후 판단에서 뇌량 크기의 중요성 (The significance of corpus callosal size in the estimation of neurologically abnormal infants)

  • 유승택;이창우
    • Clinical and Experimental Pediatrics
    • /
    • 제51권11호
    • /
    • pp.1205-1210
    • /
    • 2008
  • 목 적: 뇌자기공명영상에서 정상소견을 보인 신생아와 뇌실주위 백질연화증을 보인 보인 미숙아와 저산소성 뇌손상 소견이 보이는 만삭아의 뇌량 크기를 비교하고 양적으로 분석하여 뇌량의 크기가 신생아시기에 신경학적 경과에 유용한 지표가 될 수 있는가에 대한 평가를 하고자 하였다. 방 법: 2002년 9월부터 2005년 2월까지 원광대학교병원에서 출생한 신생아에서 경련, 주산기 가사, 소두증, 늘어지는 영아증후군(floppy infant syndrome) 등 신경학적인 이상이 의심되어 뇌자기공명영상을 시행 후 정상소견을 보인 15명의 신생아와 이상소견을 보인 27명의 환아 등 총 42명 신생아의 뇌량 크기를 비교하였다. 선천적인 뇌 기형이 동반된 경우와 염색체 이상, 대사 이상, 신경계 감염이 동반된 예는 연구대상에서 제외 하였다. 각 군의 뇌자기공명 영상에서 시상의 정중면에서 보이는 뇌량 전후의 최장 길이와 뇌량슬부의 수평 최장 두께, 체부의 수직 최장 두께, 팽대부의 수평 최장 두께를 원광대학교병원 영상 분석 시스템의 자동 측량 방법으로 기록하였고 이렇게 얻어진 각 부위의 두께를 다시 뇌량 전후의 최장 길이로 나누어서 뇌량 전후 길이에 대한 뇌량슬, 체부, 팽대부의 두께에 대한 비율을 계산하여 얻어진 측정치를 정상소견을 보인 신생아의 수치와 저산소성 허혈성 뇌증 소견이 보이는 만삭아 19명과 뇌실주위 백질연화증이 있는 미숙아 8명의 뇌량 전후 길이에 대한 뇌량 각 부위 두께의 비를 비교 분석하였다. 결 과: 뇌량 전후의 길이에 대한 뇌량슬부와 팽대부 두께의 비는 정상 소견을 보인 대조군과의 비교에서 저산소성 허혈성 뇌증 소견이 보이는 만삭아와 뇌실주위 백질연화증 소견이 보이는 미숙아에서 각각 통계학적인 의의가 있는 차이는 보이지 않았다. 그러나 뇌량 전후의 길이에 대한 뇌량체부 두께의 비는 P value가 저산소성 허혈성 뇌증 소견이 보이는 만삭아의 경우에서는 0.042, 뇌실주위 백질연화증 소견이 보이는 미숙아의 경우에서는 0.017로 정상소견을 보인 대조군과는 통계적으로 의의 있는 차이를 보였다. 결 론: 뇌량의 크기나 모양은 대뇌백질의 부피나 백질의 수초화 정도를 나타내는 좋은 지표이므로 뇌자기공명영상에서 뇌량의 크기를 양적으로 측정하는 것은 뇌발달의 평가 및 출생 전후의 뇌손상의 정도와 범위를 평가하는데 유용할 뿐만 아니라 뇌성마비나 정신발달지체 등 향후 신경학적인 예후를 추정하는 데에도 큰 도움이 될 수 있을 것으로 생각된다.

경남지방의 도태우에 불임과 관련된 난소의 형태학적 관찰 1. 난포와 황체의 출현에 대하여 (Morphological Observations of Ovaries in Relation to Infertility in Slaughtered Cows in Kyungnam Province 1. Appearance of follicles and corpus luteums in cow ovaries)

  • 양재훈;표병민;서득록;고필옥;강정부;김종섭;곽수동
    • 한국임상수의학회지
    • /
    • 제19권2호
    • /
    • pp.147-152
    • /
    • 2002
  • Ovaries from total 192 slaughtered cows, 154 Korean native cows and 38 dairy cows were collected during the slaughtering process in Kimhae, Changyoung and Yangsan abattoirs in Kyungnam province from January 2001 to January 2002. Rates of pregnant and non-pregnant and ovarian findings were invested. Rates of pregnant cows in 192 slaughtered cows were 12.5% (24 cows) and in difference of cow breeds, 11.0% (17 cows) in 154 Korean native cows and 18.4% (7 cows) in 38 dairy cows from total 192 cows, respectively. Ages of fetuses in pregnant Korean native cows were mostly less than 4 months and ages of fetuses in dairy cows were mostly about 7-8 months. Cows which each diameter of follicles and corpus luteums in same cow was more than 5-6 mm in diameter were 69.8% (134 cows) in total 192 slaughtered cows and in difference of cow breeds, 64.7% (11 cows) in 17 Korean native cows and 57.1% (4 cows) in 7 dairy cows. Mean diameter of foliicles and corpus luteums in Korean native cows are 13.7$\pm$5.6$\times$ 11.2$\pm$4.6mm and 17.5$\pm$4.6$\times$14.6$\pm$4.0 mm in non-pregnat cows, and are 11.0$\pm$4.8$\times$9.1 $\pm$ 2.6mm and 21.2$\pm$2.9$\times$18.3$\pm$ 2.7 mm in pregnant cows, respectively. Mean diameter of follicles and corpus luteums in dairy cows are 15.8$\pm$7.1 $\times$ 14.3$\pm$ 6.0 mm and 20.3$\pm$5.9$\times$16.9$\pm$ 5.8 mm in non-pregnant cows, and are 10.1 $\pm$ 3.0$\times$9.2$\pm$2.3 mm and 23.0$\pm$ 1.7$\times$20.1 $\pm$ 1.3 mm in pregnant cows, respectivley. The above findings indicate that the co-appearance rate of follicles and corpus luteums in same cows are higher in both pregnant and non-pregnant cows. Compared in pregnant and non-pregnant cow ovaries, mean size of follicles are smaller in pregnant cows but size of corpus luteums are more larger in pregnant cows than in non-pregnant cows. Correlation of the follicle size (Y) and corpus luteum size (X) in same cows developed each other in inversive size. Those correlative formulas appeared to be Y = -0.2022X+17.175 in Korean native cows and Y= -0.5754 X+24.153 in dairy cows.

Morphologic Assessment of Corpus Callosum in the Patient of Alzheimer Disease using Magnetic Resonance Imaging

  • Seoung, Youl-Hun;Choe, Bo-Young
    • 한국자기공명학회논문지
    • /
    • 제13권2호
    • /
    • pp.84-95
    • /
    • 2009
  • The purpose of this study was to evaluate the usefulness of the measurement of corpus callosum (CC) size in the Alzheimer patient by using magnetic resonance (MR) midsagittal image. We performed MR scanning in 20 normal high age group, and in 20 mild cognitive impairment (MCI) group, and in 20 Alzheimer disease (AD) group. The following parameters were employed in AD group: TRITE/FA 6650ms/66ms/$90^{\circ}$, NEX 2, Thickness/Gap 2/0, FOV 220mm. The magnetic field strength was used at 3.0 Tesla. We selected midsagittal image of the brain by using view forum program, measured CC size, which were anteroposterior length, diameter of genu, body, narrowing portion, and splenium. The present study demonstrates that CC size of Alzheimer disease can be useful for clinical assessment concerning the diameter of genu, body, and splenium.

젖소에서 발정주기중 초음파 진단장치로 측정된 황체의 크기와 progesterone 농도와의 관계 (Relationship between corpus luteum size as determined by ultrasonography and milk progesterone concentration during the estrous cycle in dairy cows)

  • 손창호;강병규;최한선
    • 대한수의학회지
    • /
    • 제35권4호
    • /
    • pp.833-841
    • /
    • 1995
  • 젖소 16두를 대상으로 발정주기중 황체의 크기와 progesterone 농도 사이의 상관관계를 알아보기 위하여 황체의 크기를 초음파 진단장치로 측정하였다. 검사우 16두중 9두는 낭종양 황체(corpus luteum with cavity) 나머지 7두는 정상 황체 (corpus luteum without cavity)를 가지고 있었다. Progesterone 농도측정을 위한 우유의 채취 및 초음파 검사는 배란일 부터 배란후 12일 까지는 매 2일 간격으로 그리고 14일 이후부터 다음 배란일 까지는 매일 실시하였다. 발정주기중 황체의 크기와 progesterone 농도는 낭종양 황체 및 정상 황체를 가지고 있는 소들 사이에서 유의성 있는 차이가 없었다. 황체의 크기와 progesterone 농도 사이의 상관관계는 황체발육기(Days 2~8)때 낭종양 황체를 가지고 있는 소가 0.71(p<0.0001), 정상 황체를 가지고 있는 소는 0.74 (p<0.0001) 이었고, 황체퇴행기 (Days -6~0) 때는 0.73(p<0.0001)과 0.76(p<0.0001) 이었으며 전체적으로는 0.70(p<0.0001) 이었다. 이처럼 발정주기중 황체의 크기와 progesterone 농도 사이에는 밀접한 상관관계가 있어서 초음파 진단장치에 의한 황체크기의 측정은 progesterone 농도를 추측할 수 있는 수단으로 이용될 수 있을 것으로 사료된다.

  • PDF

초음파검사에 의한 소의 번식장애 감별진단 및 치료법 개발 III. 발육황체와 퇴행황체의 감별 (Development of Differential Diagnosis and Treatment Method of Reproductive Disorders Using Ultrasonography in Cows III. Differential Diagnosis between Developing and Regressing Corpus Luteum)

  • 손창호;강병규;최한선;임원호;강현구;오기석;신종봉;서국현
    • 한국임상수의학회지
    • /
    • 제16권1호
    • /
    • pp.118-127
    • /
    • 1999
  • The aim of this study was to establish the method of differential diagnosis between developing and regressing corpus luteum in cows. Plasma progesterone (P$_4$) concentrations were determined by radioimmunoassay in slaughtered, cycling and pregnant cows. Ultrasonography was used to measure the corpus luteum size and histogram values for determining the correlationships between corpus luteum area or histogram values and plasma P$_4$ concentrations. The corpora lutea were monitored in vitro (water-bath scanning) by using ultrasonography with 7.5 MHz linear-array transducer in 196 slaughtered cows. The correlation coefficient between corpus luteum area and plasma P$_4$ concentrations was 0.46 (p<0.01), and between histogram values and plasma P$_4$ concentrations was -0.44 (p<0.01), respectively. The corpora lutea were monitored by ultrasonography with 5.0 MHz linear-array transrectal transducer in 188 cycling and 30 pregnant cows. The corpus luteum areas and plasma P4 concentrations were significantly different between regressing and other corpora lutea (p<0.01), and also histogram values were significantly different between regressing and developing corpola lutea (p<0.01). The correlation coefficients between corpus luteum areas and plasma P$_4$ concentrations were 0.76 (p<0.01), 0.71 (p<0.01), 0.65 (p<0.05) and 0.68 (p<0.05), and between histogram values and plasma P$_4$ concentrations were 0.74 (p<0.05), 0.71 (p<0.01), -0.52 (p<0.05) and 0.65 (p<0.05) in developing, functional, regressing and pregnant corpora lutea, respectively. These results indicate that corpus luteum areas and plasma P$_4$ concentrations were highly correlated in all stages of corpus luteum. The histogram values and plasma P$_4$ concentrations were positive correlated in developing, functional and pregnant corpora lutea, but negative correlated in regressing corpus luteum. Therefore, the measurement of corpus luteum area and histogram value by ultrasonography is reliable method for the assessment of luteal function, specially developing and regressing corpus luteum.

  • PDF