• Title/Summary/Keyword: Corpus analysis

Search Result 423, Processing Time 0.024 seconds

An Effective Estimation method for Lexical Probabilities in Korean Lexical Disambiguation (한국어 어휘 중의성 해소에서 어휘 확률에 대한 효과적인 평가 방법)

  • Lee, Ha-Gyu
    • The Transactions of the Korea Information Processing Society
    • /
    • v.3 no.6
    • /
    • pp.1588-1597
    • /
    • 1996
  • This paper describes an estimation method for lexical probabilities in Korean lexical disambiguation. In the stochastic to lexical disambiguation lexical probabilities and contextual probabilities are generally estimated on the basis of statistical data extracted form corpora. It is desirable to apply lexical probabilities in terms of word phrases for Korean because sentences are spaced in the unit of word phrase. However, Korean word phrases are so multiform that there are more or less chances that lexical probabilities cannot be estimated directly in terms of word phrases though fairly large corpora are used. To overcome this problem, similarity for word phrases is defined from the lexical analysis point of view in this research and an estimation method for Korean lexical probabilities based on the similarity is proposed. In this method, when a lexical probability for a word phrase cannot be estimated directly, it is estimated indirectly through the word phrase similar to the given one. Experimental results show that the proposed approach is effective for Korean lexical disambiguation.

  • PDF

Clinical Analysis and Investigation for the Infertile Women with Hyperprolactinemia (불임환자의 고 Prolactin 혈증에 관한 연구)

  • Kang, S.B.;Kang, B.M.;Kim, J.G.;Lee, J.Y.;Chang, Y.S.
    • Clinical and Experimental Reproductive Medicine
    • /
    • v.13 no.1
    • /
    • pp.21-28
    • /
    • 1986
  • It is now apparent that many cases of amenorrhea. oligomenorrhea. corpus luteum deficiency, galactorrhea, and infertility are due to hyperprolactinemia. We investigated the relationships between serum prolactin values and factors such as menstrual pattern, frequency of galactorrhea etc, in 135 hyperproIactinemic patients at the Seoul National University Hospital during a period of 6 years, from January, 1979 to December, 1984. The results was as follows: 1. Menstrual pattern changed according to the serum prolactin level. The frequency of amenorrhea is 1.7 percent in patients with serum prolactin levels ranged from $25{\sim}40ng/ml$, whereas 72.4 percent in patients with serum prolactin levels above 100ng/ml. 2. The incidence of galactorrhea in hyperprolactinemic patients was 3.1 percent and the frequency of galactorrhea had direct relationship with the serum prolactin level and/or the frequency of abnormal menstrual pattern. 3. The incidence of pituitary tumor in hyperprolactinemic patients was 104 percent and sixty percent of patients with serum prolactin levels above 100ng/ml had a pituitary tumor . 4. There was an inverse correlation between serum prolactin and progesletone value. 5. The frequency of anovulatory menstrual cycle evidenced by basal body temperature is 23.9 percent in patients with serum prolactin levels ranged from $20{\sim}40ng/ml$, whereas 76.9 percent in patients with serum prolactin levels above 100ng/ml.

  • PDF

Performance Comparison Analysis on Named Entity Recognition system with Bi-LSTM based Multi-task Learning (다중작업학습 기법을 적용한 Bi-LSTM 개체명 인식 시스템 성능 비교 분석)

  • Kim, GyeongMin;Han, Seunggnyu;Oh, Dongsuk;Lim, HeuiSeok
    • Journal of Digital Convergence
    • /
    • v.17 no.12
    • /
    • pp.243-248
    • /
    • 2019
  • Multi-Task Learning(MTL) is a training method that trains a single neural network with multiple tasks influences each other. In this paper, we compare performance of MTL Named entity recognition(NER) model trained with Korean traditional culture corpus and other NER model. In training process, each Bi-LSTM layer of Part of speech tagging(POS-tagging) and NER are propagated from a Bi-LSTM layer to obtain the joint loss. As a result, the MTL based Bi-LSTM model shows 1.1%~4.6% performance improvement compared to single Bi-LSTM models.

The effect of word frequency on the reduction of English CVCC syllables in spontaneous speech

  • Kim, Jungsun
    • Phonetics and Speech Sciences
    • /
    • v.7 no.3
    • /
    • pp.45-53
    • /
    • 2015
  • The current study investigated CVCC syllables in spontaneous American English speech to find out whether such syllables are produced as phonological units with a string of segments, showing a hierarchical structure. Transcribed data from the Buckeye Speech Corpus was used for the analysis in this study. The result of the current study showed that the constituents within a CVCC syllable as a phonological unit may have phonetic variations (namely, the final coda may undergo deletion). First, voiceless alveolar stops were the most frequently deleted when they occurred as the second final coda consonants of a CVCC syllable; this deletion may be an intermediate process on the way from the abstract form CVCC (with the rime VCC) to the actual pronunciation CVC (with the rime VC), a production strategy employed by some individual speakers. Second, in the internal structure of the rime, the proportion of deletion of the final coda consonant depended on the frequency of the word rather than on the position of postvocalic consonants on the sonority hierarchy. Finally, the segment following the consonant cluster proved to have an effect on the reduction of that cluster; more precisely, the following contrast was observed between obstruents and non-obstruents, reflecting the effect of sonority: when the segment following the consonant cluster was an obstruent, the proportion of deletion of the final coda consonant was increased. Among these results, the effect of word frequency played a critical role for promoting the deletion of the second coda consonant for clusters in CVCC syllables in spontaneous speech. The current study implies that the structure of syllables as phonological units can vary depending on individual speakers' lexical representation.

Pronunciation Variation Patterns of Loanwords Produced by Korean and Grapheme-to-Phoneme Conversion Using Syllable-based Segmentation and Phonological Knowledge (한국인 화자의 외래어 발음 변이 양상과 음절 기반 외래어 자소-음소 변환)

  • Ryu, Hyuksu;Na, Minsu;Chung, Minhwa
    • Phonetics and Speech Sciences
    • /
    • v.7 no.3
    • /
    • pp.139-149
    • /
    • 2015
  • This paper aims to analyze pronunciation variations of loanwords produced by Korean and improve the performance of pronunciation modeling of loanwords in Korean by using syllable-based segmentation and phonological knowledge. The loanword text corpus used for our experiment consists of 14.5k words extracted from the frequently used words in set-top box, music, and point-of-interest (POI) domains. At first, pronunciations of loanwords in Korean are obtained by manual transcriptions, which are used as target pronunciations. The target pronunciations are compared with the standard pronunciation using confusion matrices for analysis of pronunciation variation patterns of loanwords. Based on the confusion matrices, three salient pronunciation variations of loanwords are identified such as tensification of fricative [s] and derounding of rounded vowel [ɥi] and [$w{\varepsilon}$]. In addition, a syllable-based segmentation method considering phonological knowledge is proposed for loanword pronunciation modeling. Performance of the baseline and the proposed method is measured using phone error rate (PER)/word error rate (WER) and F-score at various context spans. Experimental results show that the proposed method outperforms the baseline. We also observe that performance degrades when training and test sets come from different domains, which implies that loanword pronunciations are influenced by data domains. It is noteworthy that pronunciation modeling for loanwords is enhanced by reflecting phonological knowledge. The loanword pronunciation modeling in Korean proposed in this paper can be used for automatic speech recognition of application interface such as navigation systems and set-top boxes and for computer-assisted pronunciation training for Korean learners of English.

Korean Syntactic Rules using Composite Labels (복합 레이블을 적용한 한국어 구문 규칙)

  • 김성용;이공주;최기선
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.2
    • /
    • pp.235-244
    • /
    • 2004
  • We propose a format of a binary phrase structure grammar with composite labels. The grammar adopts binary rules so that the dependency between two sub-trees can be represented in the label of the tree. The label of a tree is composed of two attributes, each of which is extracted from each sub-tree so that it can represent the compositional information of the tree. The composite label is generated from part-of-speech tags using an automatic labeling algorithm. Since the proposed rule description scheme is binary and uses only part-of-speech information, it can readily be used in dependency grammar and be applied to other languages as well. In the best-1 context-free cross validation on 31,080 tree-tagged corpus, the labeled precision is 79.30%, which outperforms phrase structure grammar and dependency grammar by 5% and by 4%, respectively. It shows that the proposed rule description scheme is effective for parsing Korean.

Building Specialized Language Model for National R&D through Knowledge Transfer Based on Further Pre-training (추가 사전학습 기반 지식 전이를 통한 국가 R&D 전문 언어모델 구축)

  • Yu, Eunji;Seo, Sumin;Kim, Namgyu
    • Knowledge Management Research
    • /
    • v.22 no.3
    • /
    • pp.91-106
    • /
    • 2021
  • With the recent rapid development of deep learning technology, the demand for analyzing huge text documents in the national R&D field from various perspectives is rapidly increasing. In particular, interest in the application of a BERT(Bidirectional Encoder Representations from Transformers) language model that has pre-trained a large corpus is growing. However, the terminology used frequently in highly specialized fields such as national R&D are often not sufficiently learned in basic BERT. This is pointed out as a limitation of understanding documents in specialized fields through BERT. Therefore, this study proposes a method to build an R&D KoBERT language model that transfers national R&D field knowledge to basic BERT using further pre-training. In addition, in order to evaluate the performance of the proposed model, we performed classification analysis on about 116,000 R&D reports in the health care and information and communication fields. Experimental results showed that our proposed model showed higher performance in terms of accuracy compared to the pure KoBERT model.

Online blind source separation and dereverberation of speech based on a joint diagonalizability constraint (공동 행렬대각화 조건 기반 온라인 음원 신호 분리 및 잔향제거)

  • Yu, Ho-Gun;Kim, Do-Hui;Song, Min-Hwan;Park, Hyung-Min
    • The Journal of the Acoustical Society of Korea
    • /
    • v.40 no.5
    • /
    • pp.503-514
    • /
    • 2021
  • Reverberation in speech signals tends to significantly degrade the performance of the Blind Source Separation (BSS) system. Especially in online systems, the performance degradation becomes severe. Methods based on joint diagonalizability constraints have been recently developed to tackle the problem. To improve the quality of separated speech, in this paper, we add the proposed de-reverberation method to the online BSS algorithm based on the constraints in reverberant environments. Through experiments on the WSJCAM0 corpus, the proposed method was compared with the existing online BSS algorithm. The performance evaluation by the Signal-to-Distortion Ratio and the Perceptual Evaluation of Speech Quality demonstrated that SDR improved from 1.23 dB to 3.76 dB and PESQ improved from 1.15 to 2.12 on average.

The Stream of Uncertainty in Scientific Knowledge using Topic Modeling (토픽 모델링 기반 과학적 지식의 불확실성의 흐름에 관한 연구)

  • Heo, Go Eun
    • Journal of the Korean Society for information Management
    • /
    • v.36 no.1
    • /
    • pp.191-213
    • /
    • 2019
  • The process of obtaining scientific knowledge is conducted through research. Researchers deal with the uncertainty of science and establish certainty of scientific knowledge. In other words, in order to obtain scientific knowledge, uncertainty is an essential step that must be performed. The existing studies were predominantly performed through a hedging study of linguistic approaches and constructed corpus with uncertainty word manually in computational linguistics. They have only been able to identify characteristics of uncertainty in a particular research field based on the simple frequency. Therefore, in this study, we examine pattern of scientific knowledge based on uncertainty word according to the passage of time in biomedical literature where biomedical claims in sentences play an important role. For this purpose, biomedical propositions are analyzed based on semantic predications provided by UMLS and DMR topic modeling which is useful method to identify patterns in disciplines is applied to understand the trend of entity based topic with uncertainty. As time goes by, the development of research has been confirmed that uncertainty in scientific knowledge is moving toward a decreasing pattern.

Pronunciation of the Korean diphthong /jo/: Phonetic realizations and acoustic properties (한국어 /ㅛ/의 발음 양상 연구: 발음형 빈도와 음향적 특징을 중심으로)

  • Hyangwon Lee
    • Phonetics and Speech Sciences
    • /
    • v.15 no.1
    • /
    • pp.9-17
    • /
    • 2023
  • The purpose of this study is to determine how the Korean diphthong /jo/ shows phonetic variation in various linguistic environments. The pronunciation of /jo/ is discussed, focusing on the relationship between phonetic variation and the distribution range of vowels. The location in a word (monosyllable, word-initial, word-medial, word-final) and word class (content word, function word) were analyzed using the speech of 10 female speakers of the Seoul Corpus. As a result of determining the frequency of appearance of /jo/ in each environment, the pronunciation type and word class were affected by the location in a word. Frequent phonetic reduction was observed in the function word /jo/ in the acoustic analysis. The word class did not change the average phonetic values of /jo/, but changed the distribution of individual tokens. These results indicate that the linguistic environment affects the phonetic distribution of vowels.