• Title/Summary/Keyword: Corpus analysis

Search Result 419, Processing Time 0.031 seconds

A Corpus-based study on the Effects of Gender on Voiceless Fricatives in American English

  • Yoon, Tae-Jin
    • Phonetics and Speech Sciences
    • /
    • v.7 no.1
    • /
    • pp.117-124
    • /
    • 2015
  • This paper investigates the acoustic characteristics of English fricatives in the TIMIT corpus, with a special focus on the role of gender in rendering fricatives in American English. The TIMIT database includes 630 talkers and 2342 different sentences, comprising over five hours of speech. Acoustic analyses are conducted in the domain of spectral and temporal properties by treating gender as an independent factor. The results of acoustic analyses revealed that the most acoustic properties of voiceless sibilants turned out to be different between male and female speakers, but those of voiceless non-sibilants did not show differences. A classification experiment using linear discriminant analysis (LDA) revealed that 85.73% of voiceless fricatives are correctly classified. The sibilants are 88.61% correctly classified, whereas the non-sibilants are only 57.91% correctly classified. The majority of the errors are from the misclassification of /ɵ/ as [f]. The average accuracy of gender classification is 77.67%. Most of the inaccuracy results are from the classification of female speakers in non-sibilants. The results are accounted for by resorting to biological differences as well as macro-social factors. The paper contributes to the understanding of the role of gender in a large-scale speech corpus.

Expression of PAPP-A and $20{\alpha}$-HSD in the Bovine Corpus Luteum during Early Pregnancy (소의 초기 임신 황체에서 PAPP-A와 $20{\alpha}$-HSD의 발현 양상)

  • Kim, Dae-Seung;Kim, Sang-Hwan;Yoon, Jong-Taek
    • Journal of Embryo Transfer
    • /
    • v.26 no.1
    • /
    • pp.57-63
    • /
    • 2011
  • This study was performed to the expressions of pregnancy-associated plasma protein-A (PAPP-A) and 20alpha-hydroxysteroid dehydrogenase ($20{\alpha}$-HSD) in bovine corpus luteum during early pregnancy. To determine the function of PAPP-A gene during early pregnancy, we collected corpus luteum samples on 30, 60 and 90 days of pregnancy in bovine. The mRNA expression of PAPP-A, $20{\alpha}$-HSD, progesterone-receptor (PR) and insulin-like growth factor binding protein4 (IGFBP4) gene was conducted by Real-time PCR. In parallel with mRNA levels, The protein expressions of PAPP-A and $20{\alpha}$-HSD were detected by immunological analysis. The mRNA expressions $20{\alpha}$-HSD and PAPP-A significantly increased on day 90 in the corpus luteum during pregnancy. The mRNA expression of PR and JGFBP4 in the corpus luteum progressively was enhanced at 30 to 60 day, but decreased on 90 day of pregnancy in the corpus luteum. The expression patterns of these genes, PAPP-A and $20{\alpha}$-HSD were similar pattern in these tissues. In conclusion, PAPP-A and $20{\alpha}$-HSD activity in corpus luteum could be played a role for early pregnancy manifestation.

A Corpus-based Analysis of EFL Learners' Use of Hedges in Cross-cultural Communication

  • Min, Su-Jung
    • English Language & Literature Teaching
    • /
    • v.16 no.4
    • /
    • pp.91-106
    • /
    • 2010
  • This study examines the use of hedges in cross-cultural communication between EFL learners in an e-learning environment. The study analyzes the use of hedges in a corpus of an interactive web with a bulletin board system through which college students of English at Japanese and Korean universities interacted with each other discussing the topics of local and global issues. It compares the use of hedges in the students' corpus to that of a native English speakers' corpus. The result shows that EFL learners tend to use relatively smaller number of hedges than the native speakers in terms of the frequencies of the total tokens. It further reveals that the learners' overuse of a single versatile high-frequency hedging item, I think, results in relative underuse of other hedging devices. This indicates that due to their small repertoire of hedges, EFL learners' overuse of a limited number of hedging items may cause their speech or writing to become less competent. Based on the result and interviews with the learners, the study also argues that hedging should be understood in its social contexts and should not be understood just as a lack of conviction or a mark of low proficiency. Suggestions were made for using computer corpora in understanding EFL learners' language difficulties and helping them develop communicative and pragmatic competence.

  • PDF

Phoneme distribution and phonological processes of orthographic and pronounced phrasal words in light of syllable structure in the Seoul Corpus (음절구조로 본 서울코퍼스의 글 어절과 말 어절의 음소분포와 음운변동)

  • Yang, Byunggon
    • Phonetics and Speech Sciences
    • /
    • v.8 no.3
    • /
    • pp.1-9
    • /
    • 2016
  • This paper investigated the phoneme distribution and phonological processes of orthographic and pronounced phrasal words in light of syllable structure in the Seoul Corpus in order to provide linguists and phoneticians with a clearer understanding of the Korean language system. To achieve the goal, the phrasal words were extracted from the transcribed label scripts of the Seoul Corpus using Praat. Following this, the onsets, peaks, codas and syllable types of the phrasal words were analyzed using an R script. Results revealed that k0 was most frequently used as an onset in both orthographic and pronounced phrasal words. Also, aa was the most favored vowel in the Korean syllable peak with fewer phonological processes in its pronounced form. The total proportion of all diphthongs according to the frequency of the peaks in the orthographic phrasal words was 8.8%, which was almost double those found in the pronounced phrasal words. For the codas, nn accounted for 34.4% of the total pronounced phrasal words and was the varied form. From syllable type classification of the Corpus, CV appeared to be the most frequent type followed by CVC, V, and VC from the orthographic forms. Overall, the onsets were more prevalent in the pronunciation more than the codas. From the results, this paper concluded that an analysis of phoneme distribution and phonological processes in light of syllable structure can contribute greatly to the understanding of the phonology of spoken Korean.

Syllable-based Probabilistic Models for Korean Morphological Analysis (한국어 형태소 분석을 위한 음절 단위 확률 모델)

  • Shim, Kwangseob
    • Journal of KIISE
    • /
    • v.41 no.9
    • /
    • pp.642-651
    • /
    • 2014
  • This paper proposes three probabilistic models for syllable-based Korean morphological analysis, and presents the performance of proposed probabilistic models. Probabilities for the models are acquired from POS-tagged corpus. The result of 10-fold cross-validation experiments shows that 98.3% answer inclusion rate is achieved when trained with Sejong POS-tagged corpus of 10 million eojeols. In our models, POS tags are assigned to each syllable before spelling recovery and morpheme generation, which enables more efficient morphological analysis than the previous probabilistic models where spelling recovery is performed at the first stage. This efficiency gains the speed-up of morphological analysis. Experiments show that morphological analysis is performed at the rate of 147K eojeols per second, which is almost 174 times faster than the previous probabilistic models for Korean morphology.

Extracting Multiword Sentiment Expressions by Using a Domain-Specific Corpus and a Seed Lexicon

  • Lee, Kong-Joo;Kim, Jee-Eun;Yun, Bo-Hyun
    • ETRI Journal
    • /
    • v.35 no.5
    • /
    • pp.838-848
    • /
    • 2013
  • This paper presents a novel approach to automatically generate Korean multiword sentiment expressions by using a seed sentiment lexicon and a large-scale domain-specific corpus. A multiword sentiment expression consists of a seed sentiment word and its contextual words occurring adjacent to the seed word. The multiword sentiment expressions that are the focus of our study have a different polarity from that of the seed sentiment word. The automatically extracted multiword sentiment expressions show that 1) the contextual words should be defined as a part of a multiword sentiment expression in addition to their corresponding seed sentiment word, 2) the identified multiword sentiment expressions contain various indicators for polarity shift that have rarely been recognized before, and 3) the newly recognized shifters contribute to assigning a more accurate polarity value. The empirical result shows that the proposed approach achieves improved performance of the sentiment analysis system that uses an automatically generated lexicon.

Generative probabilistic model with Dirichlet prior distribution for similarity analysis of research topic

  • Milyahilu, John;Kim, Jong Nam
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.4
    • /
    • pp.595-602
    • /
    • 2020
  • We propose a generative probabilistic model with Dirichlet prior distribution for topic modeling and text similarity analysis. It assigns a topic and calculates text correlation between documents within a corpus. It also provides posterior probabilities that are assigned to each topic of a document based on the prior distribution in the corpus. We then present a Gibbs sampling algorithm for inference about the posterior distribution and compute text correlation among 50 abstracts from the papers published by IEEE. We also conduct a supervised learning to set a benchmark that justifies the performance of the LDA (Latent Dirichlet Allocation). The experiments show that the accuracy for topic assignment to a certain document is 76% for LDA. The results for supervised learning show the accuracy of 61%, the precision of 93% and the f1-score of 96%. A discussion for experimental results indicates a thorough justification based on probabilities, distributions, evaluation metrics and correlation coefficients with respect to topic assignment.

한·중 한정 기능어 대조 연구 -한국어 '만, 밖에, 뿐'과 중국어 '지(只), 광(光), 근(僅)'을 중심으로-

  • Jeong, Bi
    • 중국학논총
    • /
    • no.62
    • /
    • pp.49-69
    • /
    • 2019
  • This study refers to the methodology of study of usage patterns by dissolving the study of Korean auxiliary particle '만, 밖에, 뿐' and Chinese range adverb '只, 光, 僅', and uses the actual language data of Korean native speakers and Chinese native speakers Using the constructed corpus, we looked at the usage patterns of auxiliary particles '만, 밖에, 뿐' and range adverb '只, 光, 僅' respectively. In the Korean and Chinese corpora, the Korean auxiliary particle '만, 밖에, 뿐' and Chinese range adverb '只, 光, 僅' are each 300 sentences, and a total of 1800 are used as analytical corpus. through the analysis of the examples, the features and differences such as the appearance ratio of Korean and Chinese, appearance environment are revealed. the analysis results of Korean and Chinese are compared to find common points and differences.

A Corpus Analysis of British-American Children's Adventure Novels: Treasure Island (영미 아동 모험 소설에 관한 코퍼스 분석 연구: 『보물섬』을 중심으로)

  • Choi, Eunsaem;Jung, Chae Kwan
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.1
    • /
    • pp.333-342
    • /
    • 2021
  • In this study, we analyzed the vocabulary, lemmas, keywords, and n-grams in 『Treasure Island』 to identify certain linguistic features of this British-American children's adventure novel. The current study found that, contrary to the popular claim that frequently-used words are important and essential to a story, the set of frequently-used words in 『Treasure Island』 were mostly function words and proper nouns that were not directly related to the plot found in 『Treasure Island』. We also ascertained that a list of keywords using a statistical method making use of a corpus program was not good enough to surmise the story of 『Treasure Island』. However, we managed to extract 30 keywords through the first quantitative keyword analysis and then a second qualitative keyword analysis. We also carried out a series of n-gram analyses and were able to discover lexical bundles that were preferred and frequently used by the author of 『Treasure Island』. We hope that the results of this study will help spread this knowledge among British-American children's literature as well as to further put forward corpus stylistic theory.

An Analysis on Korean Intonation Patterns Using Momel (Momel을 이용한 한국어의 억양 패턴 분석)

  • Kim, Sun-Hee;Yoo, Hyun-Ji
    • Proceedings of the KSPS conference
    • /
    • 2007.05a
    • /
    • pp.243-246
    • /
    • 2007
  • This paper aims to propose an intonation labeling method using Momel and to present results of analyzing a speech corpus consisting of 80 passages pronounced by 4 speakers (2 male and 2 female) using the proposed method. The results show that Momel works well enough to derive meaningful pitch targets, which could be labeled with H and L tones. On the other hand, the results of the analysis of Korean speech corpus correspond to earlier work.

  • PDF