• Title/Summary/Keyword: Corpus-based Analysis

Search Result 200, Processing Time 0.028 seconds

A Corpus-Based Longitudinal Study of Diction in Chinese and British News Reports on Chang'e Project

  • Lu, Rong;Xie, Xue;Qi, Jiashuang;Ali, Afida Mohamad;Zhao, Jie
    • Asia Pacific Journal of Corpus Research
    • /
    • v.3 no.1
    • /
    • pp.1-20
    • /
    • 2022
  • As a milestone progression in China's space exploration history, Chang'e Project has attracted a lot of media attention since its first launching. This study aims to examine and compare the similarities and differences between the Chinese media and the British media in using nouns, verbs, and adjectives to report the Chang'e Project. After categorising the documents based on specific project phases, we created two diachronic corpora to explore the linguistic shifts and similarities and differences of diction employed by the Chinese and British media on the Chang'e Project ideology. This longitudinal study was performed with Lancsbox and the CLAWS web tagger through critical discourse analysis as the theoretical framework. The findings of the current study showed that the Chang'e Project coverage in both media increased on an annual basis, especially after 2019. In contrast to the objectivity and positivity in the Chinese Media, the British Media seemed to be more subjective with more appraisal adjectives in the news reports. Nonetheless, both countries were trying to be objective and formal in choosing nouns and verbs. Ideology-wise, the Chinese news media reports portrayed more positivity on domestic circumstances while the British counterpart was typically more critical. Notably, the study outcomes could catalyse future research on the Chang'e Project and facilitate diplomatic policies.

Predicting CEFR Levels in L2 Oral Speech, Based on Lexical and Syntactic Complexity

  • Hu, Xiaolin
    • Asia Pacific Journal of Corpus Research
    • /
    • v.2 no.1
    • /
    • pp.35-45
    • /
    • 2021
  • With the wide spread of the Common European Framework of Reference (CEFR) scales, many studies attempt to apply them in routine teaching and rater training, while more evidence regarding criterial features at different CEFR levels are still urgently needed. The current study aims to explore complexity features that distinguish and predict CEFR proficiency levels in oral performance. Using a quantitative/corpus-based approach, this research analyzed lexical and syntactic complexity features over 80 transcriptions (includes A1, A2, B1 CEFR levels, and native speakers), based on an interview test, Standard Speaking Test (SST). ANOVA and correlation analysis were conducted to exclude insignificant complexity indices before the discriminant analysis. In the result, distinctive differences in complexity between CEFR speaking levels were observed, and with a combination of six major complexity features as predictors, 78.8% of the oral transcriptions were classified into the appropriate CEFR proficiency levels. It further confirms the possibility of predicting CEFR level of L2 learners based on their objective linguistic features. This study can be helpful as an empirical reference in language pedagogy, especially for L2 learners' self-assessment and teachers' prediction of students' proficiency levels. Also, it offers implications for the validation of the rating criteria, and improvement of rating system.

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.

A corpus-based study on the effects of voicing and gender on American English Fricatives (성대진동 및 성별이 미국영어 마찰음에 미치는 효과에 관한 코퍼스 기반 연구)

  • Yoon, Tae-Jin
    • Phonetics and Speech Sciences
    • /
    • v.10 no.2
    • /
    • pp.7-14
    • /
    • 2018
  • The paper investigates the acoustic characteristics of English fricatives in the TIMIT corpus, with a special focus on the role of voicing in rendering fricatives in American English. The TIMIT database includes 630 talkers and 2,342 different sentences, and comprises more than five hours of speech. Acoustic analyses are conducted in the domain of spectral and temporal properties by treating gender, voicing, and place of articulation as independent factors. The results of the acoustic analyses revealed that acoustic signals interact in a complex way to signal the gender, place, and voicing of fricatives. Classification experiments using a multiclass support vector machine (SVM) revealed that 78.7% of fricatives are correctly classified. The majority of errors stem from the misclassification of /θ/ as [f] and /ʒ/ as [z]. The average accuracy of gender classification is 78.7%. Most errors result from the classification of female speakers as male speakers. The paper contributes to the understanding of the effects of voicing and gender on fricatives in a large-scale speech corpus.

An Analysis on the Vocabulary in the English-Translation Version of Donguibogam Using the Corpus-based Analysis (코퍼스 분석방법을 이용한 『동의보감(東醫寶鑑)』 영역본의 어휘 분석)

  • Jung, Ji-Hun;Kim, Dong-Ryul;Kim, Do-Hoon
    • The Journal of Korean Medical History
    • /
    • v.28 no.2
    • /
    • pp.37-45
    • /
    • 2015
  • Objectives : A quantitative analysis on the vocabulary in the English translation version of Donguibogam. Methods : This study quantitatively analyzed the English-translated texts of Donguibogam with the Corpus-based analysis, and compared the quantitative results analyzing the texts of original Donguibogam. Results : As the results from conducting the corpus analysis on the English-translation version of Donguibogam, it was found that the number of total words (Token) was about 1,207,376, and the all types of used words were about 20.495 and the TTR (Type/Token Rate) was 1.69. The accumulation rate reaching to the high-ranking 1000 words was 83.54%, and the accumulation rate reaching to the high-ranking 2000 words was 90.82%. As the words having the high-ranking frequency, the function words like 'the, and of, is' mainly appeared, and for the content words, the words like 'randix, qi, rhizoma and water' were appeared in multi frequencies. As the results from comparing them with the corpus analysis results of original version of Donguibogam, it was found that the TTR was higher in the English translation version than that of original version. The compositions of function words and contents words having high-ranking frequencies were similar between the English translation version and the original version of Donguibogam. The both versions were also similar in that their statements in the parts of 'Remedies' and 'Acupuncture' showed higher composition rate of contents words than the rate of function words. Conclusions : The vocabulary in the English translation version of Donguibogam showed that this book was a book keeping the complete form of sentence and an Korean medical book at the same time. Meanwhile, the English translation version of Donguibogam had some problems like the unification of vocabulary due to several translators, and the incomplete delivery of word's meanings from the Chinese character-culture area to the English-culture area, and these problems are considered as the matters to be considered in a work translating Korean old medical books in English.

A Comparative Study on Korean Connective Morpheme '-myenseo' to the Chinese expression - based on Korean-Chinese parallel corpus (한국어 연결어미 '-면서'와 중국어 대응표현의 대조연구 -한·중 병렬 말뭉치를 기반으로)

  • YI, CHAO
    • Cross-Cultural Studies
    • /
    • v.37
    • /
    • pp.309-334
    • /
    • 2014
  • This study is based on the Korean-Chinese parallel corpus, utilizing the Korean connective morpheme '-myenseo' and contrasting with the Chinese expression. Korean learners often struggle with the use of Korean Connective Morpheme especially when there is a lexical gap between their mother language. '-myenseo' is of the most use Korean Connective Morpheme, it usually contrast to the Chinese coordinating conjunction. But according to the corpus, the contrastive Chinese expression to '-myenseo' is more than coordinating conjunction. So through this study, can help the Chinese Korean language learners learn easier while studying '-myenseo', because the variety Chinese expression are found from the parallel corpus that related to '-myenseo'. In this study, firstly discussed the semantic features and syntactic characteristics of '-myenseo'. The significant semantic features of '-myenseo' are 'simultaneous' and 'conflict'. So in this chapter the study use examples of usage to analyse the specific usage of '-myenseo'. And then this study analyse syntactic characteristics of '-myenseo' through the subject constraint, predicate constraints, temporal constraints, mood constraints, negatives constraints. then summarize them into a table. And the most important part of this study is Chapter 4. In this chapter, it contrasted the Korean connective morpheme '-myenseo' to the Chinese expression by analysing the Korean-Chinese parallel corpus. As a result of the analysis, the frequency of the Chinese expression that contrasted to '-myenseo' is summarized into

    . It can see from the table that the most common Chinese expression comparative to '-myenseo' is non-marker patterns. That means the connection of sentence in Korean can use connective morpheme what is a clarifying linguistic marker, but in Chinese it often connect the sentence by their intrinsic logical relationships. So the conclusion of this chapter is that '-myenseo' can be comparative to Chinese conjunction, expression, non-marker patterns and liberal translation patterns, which are more than Chinese conjunction that discovered before. In the last Chapter, as the conclusion part of this study, it summarized and suggest the limitations and the future research direction.

  • A Corpus-based study on the Effects of Gender on Voiceless Fricatives in American English

    • Yoon, Tae-Jin
      • Phonetics and Speech Sciences
      • /
      • v.7 no.1
      • /
      • pp.117-124
      • /
      • 2015
    • This paper investigates the acoustic characteristics of English fricatives in the TIMIT corpus, with a special focus on the role of gender in rendering fricatives in American English. The TIMIT database includes 630 talkers and 2342 different sentences, comprising over five hours of speech. Acoustic analyses are conducted in the domain of spectral and temporal properties by treating gender as an independent factor. The results of acoustic analyses revealed that the most acoustic properties of voiceless sibilants turned out to be different between male and female speakers, but those of voiceless non-sibilants did not show differences. A classification experiment using linear discriminant analysis (LDA) revealed that 85.73% of voiceless fricatives are correctly classified. The sibilants are 88.61% correctly classified, whereas the non-sibilants are only 57.91% correctly classified. The majority of the errors are from the misclassification of /ɵ/ as [f]. The average accuracy of gender classification is 77.67%. Most of the inaccuracy results are from the classification of female speakers in non-sibilants. The results are accounted for by resorting to biological differences as well as macro-social factors. The paper contributes to the understanding of the role of gender in a large-scale speech corpus.

    The pattern of use by gender and age of the discourse markers 'a', 'eo', and 'eum' (담화표지 '아', '어', '음'의 성별과 연령별 사용 양상)

    • Song, Youngsook;Shim, Jisu;Oh, Jeahyuk
      • Phonetics and Speech Sciences
      • /
      • v.12 no.4
      • /
      • pp.37-45
      • /
      • 2020
    • This paper quantitatively calculated the speech frequency of the discourse markers 'a', 'eo', and 'eum' and the speech duration of these discourse markers using the Seoul Corpus, a spontaneous speech corpus. The sound durations were confirmed with Praat, the Seoul Corpus was analyzed with Emeditor, and the results were presented by statistical analysis with R. Based on the corpus analysis, the study investigated whether a particular factor is preferred by speakers of particular categories. The most prominent feature of the corpus is that the sound durations of female speakers were longer than those of men when using the 'eum' discourse marker in a final position. In age-related variables, teenagers uttered 'a' more than 'eo' in an initial position when compared to people in their 40s. This study is significant because it has quantitatively analyzed the discourse markers 'a', 'eo', and 'eum' by gender and age. In order to continue the discussion, more precise research should be conducted considering the context. In addition, similarities can be found in "e" and "ma" in Japanese(Watanabe & Ishi, 2000) and 'uh', 'um' in English(Gries, 2013). afterwards, a study to identify commonalities and differences can be predicted by using the cross-linguistic analysis of the discourse.

    A Corpus-based English Syntax Academic Word List Building and its Lexical Profile Analysis (코퍼스 기반 영어 통사론 학술 어휘목록 구축 및 어휘 분포 분석)

    • Lee, Hye-Jin;Lee, Je-Young
      • The Journal of the Korea Contents Association
      • /
      • v.21 no.12
      • /
      • pp.132-139
      • /
      • 2021
    • This corpus-driven research expounded the compilation of the most frequently occurring academic words in the domain of syntax and compared the extracted wordlist with Academic Word List(AWL) of Coxhead(2000) and General Service List(GSL) of West(1953) to examine their distribution and coverage within the syntax corpus. A specialized 546,074 token corpus, composed of widely used must-read syntax textbooks for English education majors, was loaded into and analyzed with AntWordProfiler 1.4.1. Under the parameter of lexical frequency, the analysis identified 288(50.5%) AWL word forms, appeared 16 times or more, as well as 218(38.2%) AWL items, occurred not exceeding 15 times. The analysis also indicated that the coverage of AWL and GSL accounted for 9.19% and 78.92% respectively and the combination of GSL and AWL amounted to 88.11% of all tokens. Given that AWL can be instrumental in serving broad disciplinary needs, this study highlighted the necessity to compile the domain-specific AWL as a lexical repertoire to promote academic literacy and competence.

    Examining Suicide Tendency Social Media Texts by Deep Learning and Topic Modeling Techniques (딥러닝 및 토픽모델링 기법을 활용한 소셜 미디어의 자살 경향 문헌 판별 및 분석)

    • Ko, Young Soo;Lee, Ju Hee;Song, Min
      • Journal of the Korean BIBLIA Society for library and Information Science
      • /
      • v.32 no.3
      • /
      • pp.247-264
      • /
      • 2021
    • This study aims to create a deep learning-based classification model to classify suicide tendency by suicide corpus constructed for the present study. Also, to analyze suicide factors, the study classified suicide tendency corpus into detailed topics by using topic modeling, an analysis technique that automatically extracts topics. For this purpose, 2,011 documents of the suicide-related corpus collected from social media naver knowledge iN were directly annotated into suicide-tendency documents or non-suicide-tendency documents based on suicide prevention education manual issued by the Central Suicide Prevention Center, and we also conducted the deep learning model(LSTM, BERT, ELECTRA) performance evaluation based on the classification model, using annotated corpus data. In addition, one of the topic modeling techniques, LDA identified suicide factors by classifying thematic literature, and co-word analysis and visualization were conducted to analyze the factors in-depth.