• Title/Summary/Keyword: Corpus-based Study

Search Result 204, Processing Time 0.025 seconds

Building a Korean-English Parallel Corpus by Measuring Sentence Similarities Using Sequential Matching of Language Resources and Topic Modeling (언어 자원과 토픽 모델의 순차 매칭을 이용한 유사 문장 계산 기반의 위키피디아 한국어-영어 병렬 말뭉치 구축)

  • Cheon, JuRyong;Ko, YoungJoong
    • Journal of KIISE
    • /
    • v.42 no.7
    • /
    • pp.901-909
    • /
    • 2015
  • In this paper, to build a parallel corpus between Korean and English in Wikipedia. We proposed a method to find similar sentences based on language resources and topic modeling. We first applied language resources(Wiki-dictionary, numbers, and online dictionary in Daum) to match word sequentially. We construct the Wiki-dictionary using titles in Wikipedia. In order to take advantages of the Wikipedia, we used translation probability in the Wiki-dictionary for word matching. In addition, we improved the accuracy of sentence similarity measuring method by using word distribution based on topic modeling. In the experiment, a previous study showed 48.4% of F1-score with only language resources based on linear combination and 51.6% with the topic modeling considering entire word distributions additionally. However, our proposed methods with sequential matching added translation probability to language resources and achieved 9.9% (58.3%) better result than the previous study. When using the proposed sequential matching method of language resources and topic modeling after considering important word distributions, the proposed system achieved 7.5%(59.1%) better than the previous study.

Gender-Based Differences in Expository Language Use: A Corpus Study of Japanese

  • Heffernan, Kevin;Nishino, Keiko
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.2
    • /
    • pp.1-14
    • /
    • 2020
  • Previous work has shown that men both explain and value the act of explaining more than women, as explaining conveys expertise. However, previous studies are limited to English. We conducted an exploratory study to see if similar patterns are seen amongst Japanese speakers. We examined three registers of Japanese: conversational interviews, simulated speeches, and academic presentations. For each text, we calculated two measures: lexical density and the percentage of the text written in kanji. Both are indicators of expository language. Men produced significantly higher scores for the interviews and speeches. However, the results for the presentations depend on age and academic field. In fields in which women are the minority, women produce higher scores. In the field in which men are the minority, younger men produced higher scores but older men produced lower scores than women of the same age. Our results show that in academic contexts, the explainers are not necessarily men but rather the gender minority. We argue that such speakers are under social pressure to present themselves as experts. These results show that the generalization that men tend to explain more than women does not always hold true, and we urge more academic work on expository language.

A Study on Genetic Analysis and Extract Cytotoxicity of Scolopendra subspinipes multilans L. Koch (노랑머리왕지네의 유전학적(遺傳學的) 분석(分析) 및 약침액(藥鍼液)의 세포독성(細胞毒性)에 관한 연구(硏究))

  • Kim, Sung-Nam;Lim, Jeong-A;Lee, Sung-Yong;Hwang, Woo-Jun;Lee, Geon-Mok;Cho, Nam-Geun;Seo, Jung-Chul;Moon, Hyung-Cheol;Kim, Sung-Chul
    • Journal of Pharmacopuncture
    • /
    • v.9 no.2
    • /
    • pp.49-65
    • /
    • 2006
  • Objective : The purpose of this study is to investigate nucleotide sequence and extract cytotoxicity of Scolopendrae corpus. The nature and taste of Scolopendrae corpus is hot, Warm and toxic, and the effect of this is dispelling wind, anti-spasmodic action and detoxication so it has been used for C.V.A, facial palsy, sensory disorder at extremities, wounds and arthritis. Methods : Scolopendrae corpus were collected by locality on the market. They were morphologically classified. Their nucleotide sequence was investigated and compared among them. In addition, the water-alcohol extract cytotoxicity of them was studied by MTT-based cytotoxicity assay. Results : It was shown that the each Scolopendrae corpus by locality is almost identical at genetic result and is identified as Scolopendra subspinipes mutilans L. Koch. Nucleotide sequence of Scolopendra subspinipes mutilans L. Koch in this study will help to discriminate other species of Scolopendrae corpus. The water-alcohol extract of Scolopendra subspinipes mutilans L. Koch did not induce cytotoxicity on Hep G2, L929 cell and peritoneal macrophages. Besides, it did not influence nitrite production of peritoneal macrophages. These results can be used as basic data for genetic discrimination with another species of scolopendrae corpus.

A Corpus-Based Study on Korean EFL Learners' Use of English Logical Connectors

  • Ha, Myung-Jeong
    • International Journal of Contents
    • /
    • v.10 no.4
    • /
    • pp.48-52
    • /
    • 2014
  • The purpose of this study was to examine 30 logical connectors in the essay writing of Korean university students for comparison with the use in similar types of native English writing. The main questions addressed were as follows: Do Korean EFL students tend to over- or underuse logical connectors? What types of connectors differentiate Korean learners from native use? To answer these questions, EFL learner data were compared with data from native speakers using computerized corpora and linguistic software tools to speed up the initial stage of the linguistic analysis. The analysis revealed that Korean EFL learners tend to overuse logical connectors in the initial position of the sentence, and that they tend to overuse additive connectors such as 'moreover', 'besides', and 'furthermore', whereas they underuse contrastive connectors such as 'yet' and 'instead'. On the basis of the results of this study, some pedagogical implications are made concerning the need for teaching of the semantic, stylistic, and syntactic behavior of logical connectors.

An Analysis of the Intellectual Structure of Venture-Creation Studies to build an Entrepreneurship Ontology (창업 온톨로지 구축을 위한 벤처창업 연구의 지식구조 분석)

  • Sim, Jae-Hu;Choi, Myeonggil
    • Knowledge Management Research
    • /
    • v.14 no.4
    • /
    • pp.75-86
    • /
    • 2013
  • The deeping interests and research toward Entrepreneurship, which is considered as an potential alternative for solving the continuing economic recession in the $21^{st}$ century, have grown. The process and methodology of the research could not be systematically arranged and the results of the research lack in efforts on the application of increasing suceess ratio in starting new business. This study adopted corpus methodology, through which we try to analyzes the knowledge structure in entrepreneurship research, derive essential concepts and the consisting domains in venture research. Based on the results of analysis, this study constructs the knowledge structure of venture research in a form of knowledge ontology. The results of the study could be a ground for entrepreneurship research and utilized as implication for a creation of construction for the entrepreneurship knowledge ontology.

  • PDF

A Corpus-based Study of the Truth-related Words in Korean Used as Discourse Markers (한국어에 나타나는 '진실' 표현 어휘의 담화표지 기능 연구)

  • Kim, Taeho;Jeong, Seon-yeong
    • Cross-Cultural Studies
    • /
    • v.29
    • /
    • pp.453-477
    • /
    • 2012
  • This study investigates how the truth-related words in Korean, which were originally noun or adverb with 'truth' related meaning, can be used as discourse markers with the functions such as 'emphatic marker', 'attention getter', or 'hesitation marker', and it argues that such functions of the discourse markers are the result of grammaticalization process. That is to say that the truth-related words have acquired new functions as discourse markers from their corresponding lexical items as a noun or an adverb through grammaticalization process. In this study, we demonstrate that the truth-related words tend to appear sentence-initially or sentence-medially when they are used as discourse markers. We also show that they are most likely to be used as emphatic marker because of the lexical meaning of the truth-related words. Finally, we state that truth-related words differ from one another in where they appear and what function they are used with.

A Study of the Automatic Extraction of Hypernyms arid Hyponyms from the Corpus (코퍼스를 이용한 상하위어 추출 연구)

  • Pang, Chan-Seong;Lee, Hae-Yun
    • Korean Journal of Cognitive Science
    • /
    • v.19 no.2
    • /
    • pp.143-161
    • /
    • 2008
  • The goal of this paper is to extract the hyponymy relation between words in the corpus. Adopting the basic algorithm of Hearst (1992), I propose a method of pattern-based extraction of semantic relations from the corpus. To this end, I set up a list of hypernym-hyponym pairs from Sejong Electronic Dictionary. This list is supplemented with the superordinate-subordinate terms of CoroNet. Then, I extracted all the sentences from the corpus that include hypemym-hyponym pairs of the list. From these extracted sentences, I collected all the sentences that contain meaningful constructions that occur systematically in the corpus. As a result, we could obtain 21 generalized patterns. Using the PERL program, we collected sentences of each of the 21 patterns. 57% of the sentences are turned out to have hyponymy relation. The proposed method in this paper is simpler and more advanced than that in Cederberg and Widdows (2003), in that using a word net or an electronic dictionary is generally considered to be efficient for information retrieval. The patterns extracted by this method are helpful when we look fer appropriate documents during information retrieval, and they are used to expand the concept networks like ontologies or thesauruses. However, the word order of Korean is relatively free and it is difficult to capture various expressions of a fired pattern. In the future, we should investigate more semantic relations than hyponymy, so that we can extract various patterns from the corpus.

  • PDF

An Analysis on the Vocabulary in the English-Translation Version of Donguibogam Using the Corpus-based Analysis (코퍼스 분석방법을 이용한 『동의보감(東醫寶鑑)』 영역본의 어휘 분석)

  • Jung, Ji-Hun;Kim, Dong-Ryul;Kim, Do-Hoon
    • The Journal of Korean Medical History
    • /
    • v.28 no.2
    • /
    • pp.37-45
    • /
    • 2015
  • Objectives : A quantitative analysis on the vocabulary in the English translation version of Donguibogam. Methods : This study quantitatively analyzed the English-translated texts of Donguibogam with the Corpus-based analysis, and compared the quantitative results analyzing the texts of original Donguibogam. Results : As the results from conducting the corpus analysis on the English-translation version of Donguibogam, it was found that the number of total words (Token) was about 1,207,376, and the all types of used words were about 20.495 and the TTR (Type/Token Rate) was 1.69. The accumulation rate reaching to the high-ranking 1000 words was 83.54%, and the accumulation rate reaching to the high-ranking 2000 words was 90.82%. As the words having the high-ranking frequency, the function words like 'the, and of, is' mainly appeared, and for the content words, the words like 'randix, qi, rhizoma and water' were appeared in multi frequencies. As the results from comparing them with the corpus analysis results of original version of Donguibogam, it was found that the TTR was higher in the English translation version than that of original version. The compositions of function words and contents words having high-ranking frequencies were similar between the English translation version and the original version of Donguibogam. The both versions were also similar in that their statements in the parts of 'Remedies' and 'Acupuncture' showed higher composition rate of contents words than the rate of function words. Conclusions : The vocabulary in the English translation version of Donguibogam showed that this book was a book keeping the complete form of sentence and an Korean medical book at the same time. Meanwhile, the English translation version of Donguibogam had some problems like the unification of vocabulary due to several translators, and the incomplete delivery of word's meanings from the Chinese character-culture area to the English-culture area, and these problems are considered as the matters to be considered in a work translating Korean old medical books in English.

Cross-Lingual Style-Based Title Generation Using Multiple Adapters (다중 어댑터를 이용한 교차 언어 및 스타일 기반의 제목 생성)

  • Yo-Han Park;Yong-Seok Choi;Kong Joo Lee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.8
    • /
    • pp.341-354
    • /
    • 2023
  • The title of a document is the brief summarization of the document. Readers can easily understand a document if we provide them with its title in their preferred styles and the languages. In this research, we propose a cross-lingual and style-based title generation model using multiple adapters. To train the model, we need a parallel corpus in several languages with different styles. It is quite difficult to construct this kind of parallel corpus; however, a monolingual title generation corpus of the same style can be built easily. Therefore, we apply a zero-shot strategy to generate a title in a different language and with a different style for an input document. A baseline model is Transformer consisting of an encoder and a decoder, pre-trained by several languages. The model is then equipped with multiple adapters for translation, languages, and styles. After the model learns a translation task from parallel corpus, it learns a title generation task from monolingual title generation corpus. When training the model with a task, we only activate an adapter that corresponds to the task. When generating a cross-lingual and style-based title, we only activate adapters that correspond to a target language and a target style. An experimental result shows that our proposed model is only as good as a pipeline model that first translates into a target language and then generates a title. There have been significant changes in natural language generation due to the emergence of large-scale language models. However, research to improve the performance of natural language generation using limited resources and limited data needs to continue. In this regard, this study seeks to explore the significance of such research.

Frequency of grammar items for Korean substitution of /u/ for /o/ in the word-final position (어말 위치 /ㅗ/의 /ㅜ/ 대체 현상에 대한 문법 항목별 출현빈도 연구)

  • Yoon, Eunkyung
    • Phonetics and Speech Sciences
    • /
    • v.12 no.1
    • /
    • pp.33-42
    • /
    • 2020
  • This study identified the substitution of /u/ for /o/ (e.g., pyəllo [pyəllu]) in Korean based on the speech corpus as a function of grammar items. Korean /o/ and /u/ share the vowel feature [+rounded], but are distinguished in terms of tongue height. However, researchers have reported that the merger of Korean /o/ and /u/ is in progress, making them indistinguishable. Thus, in this study, the frequency of the phonetic manifestation /u/ of the underlying form of /o/ for each grammar item was calculated in The Korean Corpus of Spontaneous Speech (Seoul Corpus 2015) which is a large corpus from a total of 40 speakers from Seoul or Gyeonggi-do. It was then confirmed that linking endings, particles, and adverbs ending with /o/ in the word-final position were substituted for /u/ approximately 50% of the stimuli, whereas, in nominal items, they were replaced at a frequency of less than 5%. The high rates of substitution were the special particle "-do[du]" (59.6%) and the linking ending "-go[gu]" (43.5%) among high-frequency items. Observing Korean pronunciation in real life provides deep insight into its theoretical implications in terms of speech recognition.