• Title/Summary/Keyword: Corpus-based Study

Search Result 207, Processing Time 0.024 seconds

Data Mining Research on Maehwado Painting Poetry in the Early Joseon Dynasty

  • Haeyoung Park;Younghoon An
    • Journal of Information Processing Systems
    • /
    • v.19 no.4
    • /
    • pp.474-482
    • /
    • 2023
  • Data mining is a technique for extracting valuable information from vast amounts of data by analyzing statistical and mathematical operations, rules, and relationships. In this study, we employed data mining technology to analyze the data concerning the painting poetry of Maehwado (plum blossom paintings) from the early Joseon Dynasty. The data was extracted from the Hanguk Munjip Chonggan (Korean Literary Collections in Classical Chinese) in the Hanguk Gojeon Jonghap database (Korea Classics DB). Using computer information processing techniques, we carried out web scraping and classification of the painting poetry from the Hanguk Munjip Chonggan. Subsequently, we narrowed down our focus to the painting poetry specifically related to Maehwado in the early Joseon Dynasty. Based on this, refined dataset, we conducted an in-depth analysis and interpretation of the text data at the syllable corpus level. As a result, we found a direct correlation between the corpus statistics for each syllable in Maehwado painting poetry and the symbolic meaning of plum blossoms.

A Corpus-based study on the Effects of Gender on Voiceless Fricatives in American English

  • Yoon, Tae-Jin
    • Phonetics and Speech Sciences
    • /
    • v.7 no.1
    • /
    • pp.117-124
    • /
    • 2015
  • This paper investigates the acoustic characteristics of English fricatives in the TIMIT corpus, with a special focus on the role of gender in rendering fricatives in American English. The TIMIT database includes 630 talkers and 2342 different sentences, comprising over five hours of speech. Acoustic analyses are conducted in the domain of spectral and temporal properties by treating gender as an independent factor. The results of acoustic analyses revealed that the most acoustic properties of voiceless sibilants turned out to be different between male and female speakers, but those of voiceless non-sibilants did not show differences. A classification experiment using linear discriminant analysis (LDA) revealed that 85.73% of voiceless fricatives are correctly classified. The sibilants are 88.61% correctly classified, whereas the non-sibilants are only 57.91% correctly classified. The majority of the errors are from the misclassification of /ɵ/ as [f]. The average accuracy of gender classification is 77.67%. Most of the inaccuracy results are from the classification of female speakers in non-sibilants. The results are accounted for by resorting to biological differences as well as macro-social factors. The paper contributes to the understanding of the role of gender in a large-scale speech corpus.

Translating English By-Phrase Passives into Korean: A Parallel Corpus Analysis (영한 병렬 코퍼스에 나타난 영어 수동문의 한국어 번역)

  • Lee, Seung-Ah
    • Journal of English Language & Literature
    • /
    • v.56 no.5
    • /
    • pp.871-905
    • /
    • 2010
  • This paper is motivated by Watanabe's (2001) observation that English byphrase passives are sometimes translated into Japanese object topicalization constructions. That is, the original English sentence in the passive may be translated into the active voice with the logical object topicalized. A number of scholars, including Chomsky (1981) and Baker (1992), have remarked that languages have various ways to avoid focusing on the logical subject. The aim of the present study is to examine the translation equivalents of the English by-phrase passives in an English-Korean parallel corpus compiled by the author. A small sample of articles from Newsweek magazine and its published Korean translation reveals that there are indeed many ways to translate English by-phrase passives, including object topicalization (12.5%). Among the 64 translated sentences analyzed and classified, 12 (18.8%) examples were problematic in terms of agent defocusing, which is the primary function of passives. Of these 12 instances, five cases were identified where an alternative translation would be more suitable. The results suggest that the functional characteristics of English by-phrase passives should be highlighted in translator training as well as language teaching.

KONG-DB: Korean Novel Geo-name DB & Search and Visualization System Using Dictionary from the Web (KONG-DB: 웹 상의 어휘 사전을 활용한 한국 소설 지명 DB, 검색 및 시각화 시스템)

  • Park, Sung Hee
    • Journal of the Korean Society for information Management
    • /
    • v.33 no.3
    • /
    • pp.321-343
    • /
    • 2016
  • This study aimed to design a semi-automatic web-based pilot system 1) to build a Korean novel geo-name, 2) to update the database using automatic geo-name extraction for a scalable database, and 3) to retrieve/visualize the usage of an old geo-name on the map. In particular, the problem of extracting novel geo-names, which are currently obsolete, is difficult to solve because obtaining a corpus used for training dataset is burden. To build a corpus for training data, an admin tool, HTML crawler and parser in Python, crawled geo-names and usages from a vocabulary dictionary for Korean New Novel enough to train a named entity tagger for extracting even novel geo-names not shown up in a training corpus. By means of a training corpus and an automatic extraction tool, the geo-name database was made scalable. In addition, the system can visualize the geo-name on the map. The work of study also designed, implemented the prototype and empirically verified the validity of the pilot system. Lastly, items to be improved have also been addressed.

A Study on the Markup Scheme for Building the Corpora of Korean Culinary Manuscripts (한글 필사본 음식조리서 말뭉치 구축을 위한 마크업 방안 연구)

  • An, Ui-Jeong;Park, Jin-Yang;Nam, Gil-Im
    • Language and Information
    • /
    • v.12 no.2
    • /
    • pp.95-114
    • /
    • 2008
  • This study aims at establishing a markup system for 17-19th century culinary manuscripts. To achieve this aim, we, in section 2, look into various theoretical considerations regarding encoding large-scale historical corpora. In section 3, we identify and analyze the characteristics of textual theme and structure of our source text. Section 4 proposes a markup scheme based on the XML standard for bibliographical and structural markups for the corpus as well as the grammatical annotations. We show that it is highly desirable to use XML-based markup system since it is extremely powerful and flexible in its expressiveness and scalable. The markup scheme we suggest is a modified and extended version of the TEI-P5 to accommodate the textual and linguistic characteristics of premodern Korean culinary manuscripts.

  • PDF

An Establishment of Entrepreneurship Ontology through Analysis of Intellectual Structure in Entrepreneurship Research (창업학 지식구조 분석결과를 활용한 창업 온톨로지 구축)

  • Shimi, Jaehu;Choi, Myeonggil
    • Journal of Information Technology Applications and Management
    • /
    • v.20 no.2
    • /
    • pp.161-176
    • /
    • 2013
  • The outcomes of entrepreneurship studies have been tried to help the entrepreneurs in start-up stages, but the outcomes of the entrepreneurship research are not fully utilized to guide the activities of the entrepreneurs in start-up businesses. To utilize the outcomes of entrepreneurship research for helping entrepreneurs effectively, an entrepreneurship ontology, a systemized specification of the knowledge in the entrepreneurship research, has to be established, Based on the entrepreneurship ontology, the knowledge of entrepreneurial processes can be illustrated, and a diagnosis and coaching system for the entrepreneurs can be built effectively. To establish an entrepreneurship ontology, this study investigates the intellectual structure of entrepreneurship studies by analyzing the contents of top journals in entrepreneurship field, and identifies the relationship among the key concepts through bibliometric analyses based on entrepreneurship corpus, This study suggests a method of establishing entrepreneurship ontology and utilization of the ontology. Through utilization of the entrepreneurship ontology, it is expected to explain the entrepreneurial processes effectively and to improve the rate of business success.

A Study to Rethink the Components of Teaching Korean Genitive Particle '의': Based on the Errors in Korean Learners' Corpus (한국어 학습자 대상 관형격 조사 '의'의 교육 내용 재고: 학습자 말뭉치에 나타난 오류를 바탕으로)

  • Soo-Hyun Lee;Ji-Young Sim
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.26 no.3
    • /
    • pp.443-454
    • /
    • 2023
  • The purpose of this study is to reveal the Korean learners' usage pattern of '의', the genitive particle, according to semantic classification, so that it can be referred to in determining the contents and methods of related education. The method of this study adopts a quantitative analysis using learners corpus established by National Institute of Korean Language. As a result of the analysis, as proficiency increases, the overall frequency of '의' increases and the number of meaning senses used increases. However, the frequency of errors also increases with it. As for the usage pattern of each sense, the meaning of 'ownership, belonging' is the most frequent, and followed by 'acting entity', 'kinship, social relations', and 'relationship(area)'. In conclusion, the meanings of 'acting subjects' and 'relationships(area) need to be supplemented with explicit education. Other meanings need to be discussed, and decisions should be made in consideration of learning purpose and proficiency.

A study on performance improvement considering the balance between corpus in Neural Machine Translation (인공신경망 기계번역에서 말뭉치 간의 균형성을 고려한 성능 향상 연구)

  • Park, Chanjun;Park, Kinam;Moon, Hyeonseok;Eo, Sugyeong;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.5
    • /
    • pp.23-29
    • /
    • 2021
  • Recent deep learning-based natural language processing studies are conducting research to improve performance by training large amounts of data from various sources together. However, there is a possibility that the methodology of learning by combining data from various sources into one may prevent performance improvement. In the case of machine translation, data deviation occurs due to differences in translation(liberal, literal), style(colloquial, written, formal, etc.), domains, etc. Combining these corpora into one for learning can adversely affect performance. In this paper, we propose a new Corpus Weight Balance(CWB) method that considers the balance between parallel corpora in machine translation. As a result of the experiment, the model trained with balanced corpus showed better performance than the existing model. In addition, we propose an additional corpus construction process that enables coexistence with the human translation market, which can build high-quality parallel corpus even with a monolingual corpus.

The f0 distribution of Korean speakers in a spontaneous speech corpus

  • Yang, Byunggon
    • Phonetics and Speech Sciences
    • /
    • v.13 no.3
    • /
    • pp.31-37
    • /
    • 2021
  • The fundamental frequency, or f0, is an important acoustic measure in the prosody of human speech. The current study examined the f0 distribution of a corpus of spontaneous speech in order to provide normative data for Korean speakers. The corpus consists of 40 speakers talking freely about their daily activities and their personal views. Praat scripts were created to collect f0 values, and a majority of obvious errors were corrected manually by watching and listening to the f0 contour on a narrow-band spectrogram. Statistical analyses of the f0 distribution were conducted using R. The results showed that the f0 values of all the Korean speakers were right-skewed, with a pointy distribution. The speakers produced spontaneous speech within a frequency range of 274 Hz (from 65 Hz to 339 Hz), excluding statistical outliers. The mode of the total f0 data was 102 Hz. The female f0 range, with a bimodal distribution, appeared wider than that of the male group. Regression analyses based on age and f0 values yielded negligible R-squared values. As the mode of an individual speaker could be predicted from the median, either the median or mode could serve as a good reference for the individual f0 range. Finally, an analysis of the continuous f0 points of intonational phrases revealed that the initial and final segments of the phrases yielded several f0 measurement errors. From these results, we conclude that an examination of a spontaneous speech corpus can provide linguists with useful measures to generalize acoustic properties of f0 variability in a language by an individual or groups. Further studies would be desirable of the use of statistical measures to secure reliable f0 values of individual speakers.