• 제목/요약/키워드: Corpus-based Study

Search Result 204, Processing Time 0.026 seconds

A Study in Design and Construction of Structured Documents for Dialogue Corpus (대화형 코퍼스의 설계 및 구조적 문서화에 관한 연구)

  • Kang Chang-Qui;Nam Myung-Woo;Yang Ok-Yul
    • The Journal of the Korea Contents Association
    • /
    • v.4 no.4
    • /
    • pp.1-10
    • /
    • 2004
  • Dialogue speech corpora that contain sufficient dialogue speech features are needed for performance assessment of a spoken language dialogue system. And labeling information of dialogue speech corpora plays an important role for improvement of recognition rate in acoustic and language models. In this paper, we examine the methods by which labeling information of dialogue speech corpora can be structured. More specifically, we examined how to represent features of dialogue speech in a structured document based XML and how to design the repository system of the information.

  • PDF

Text Classification on Social Network Platforms Based on Deep Learning Models

  • YA, Chen;Tan, Juan;Hoekyung, Jung
    • Journal of information and communication convergence engineering
    • /
    • v.21 no.1
    • /
    • pp.9-16
    • /
    • 2023
  • The natural language on social network platforms has a certain front-to-back dependency in structure, and the direct conversion of Chinese text into a vector makes the dimensionality very high, thereby resulting in the low accuracy of existing text classification methods. To this end, this study establishes a deep learning model that combines a big data ultra-deep convolutional neural network (UDCNN) and long short-term memory network (LSTM). The deep structure of UDCNN is used to extract the features of text vector classification. The LSTM stores historical information to extract the context dependency of long texts, and word embedding is introduced to convert the text into low-dimensional vectors. Experiments are conducted on the social network platforms Sogou corpus and the University HowNet Chinese corpus. The research results show that compared with CNN + rand, LSTM, and other models, the neural network deep learning hybrid model can effectively improve the accuracy of text classification.

English Hedge Expressions and Korean Endings: Grammar Explanation for English-Speaking Leaners of Korean (영어 완화 표지와 한국어 종결어미 비교 - 영어권 학습자를 위한 문법 설명 -)

  • Kim, Young A
    • Journal of Korean language education
    • /
    • v.25 no.1
    • /
    • pp.1-27
    • /
    • 2014
  • This study investigates how common English hedge expressions such as 'I think' and 'I guess' appear in Korean, with the aim of providing explicit explanation for English-speaking leaners of Korean. Based on a contrastive analysis of spoken English and Korean corpus, this study argues three points: Firstly, 'I guess' appears with a wider variety of modalities in Korean than 'I think'. Secondly, this study has found that Korean textbooks contain inappropriate use of registers regarding the English translations of '-geot -gat-': although these markers are used in spoken Korean, they were translated into written English. Therefore, this study suggests that '-geot -gat-' be translated into 'I think' in spoken English, and into 'it seems' in the case of written English and narratives. Lastly, the contrastive analysis has shown that when 'I think' is used with deontic modalities such as 'I think I have to', Korean use '-a-ya-get-': the use of hedge marker 'I think' with 'I have to', which shows obligation or speaker's volition turns the deontic modalities into expressions of speaker's opinion.

Using Corpora for the Study of Word-Formation: A Case Study in English Negative Prefixation

  • Kwon, Heok-Seung
    • Korean Journal of English Language and Linguistics
    • /
    • v.1 no.3
    • /
    • pp.369-386
    • /
    • 2001
  • This paper will show that traditional approaches to the derivation of different negative words have been of an essentially hypothetical nature, based on either linguists' intuitions or rather scant evidence, and that native-speaker dictionary entries show meaning potentials (rather than meanings) which are in fact linguistic and cognitive prototypes. The purpose of this paper is to demonstrate that using a large corpus of natural language can provide better answers to questions about word-formation (i.e., with particular reference to negative prefixation) than any other source of information.

  • PDF

Radiologic Determination of Corpus Callosum Injury in Patients with Mild Traumatic Brain Injury and Associated Clinical Characteristics

  • Kim, Dong Shin;Choi, Hyuk Jai;Yang, Jin Seo;Cho, Yong Jun;Kang, Suk Hyung
    • Journal of Korean Neurosurgical Society
    • /
    • v.58 no.2
    • /
    • pp.131-136
    • /
    • 2015
  • Objective : To investigate the incidence of corpus callosum injury (CCI) in patients with mild traumatic brain injury (TBI) using brain MRI. We also performed a review of the clinical characteristics associated with this injury. Methods : A total of 356 patients in the study were diagnosed with TBI, with 94 patients classified as having mild TBI. We included patients with mild TBI for further evaluation if they had normal findings via brain computed tomography (CT) scans and also underwent brain MRI in the acute phase following trauma. As assessed by brain MRI, CCI was defined as a high-signal lesion in T2 sagittal images and a corresponding low-signal lesion as determined by axial gradient echo (GRE) imaging. Based on these criteria, we divided patients into two groups for further analysis : Group I (TBI patients with CCI) and Group II (TBI patients without CCI). Results : A total of 56 patients were enrolled in this study (including 16 patients in Group I and 40 patients in Group II). Analysis of clinical symptoms revealed a significant difference in headache severity between groups. Over 50% of patients in Group I experienced prolonged neurological symptoms including dizziness and gait disturbance and were more common in Group I than Group II (dizziness : 37 and 12% in Groups I and II, respectively; gait disturbance : 12 and 0% in Groups I and II, respectively). Conclusion : The incidence of CCI in patients with mild TBI was approximately 29%. We suggest that brain MRI is a useful method to reveal the cause of persistent symptoms and predict clinical prognosis.

Part-of-speech Tagging for Hindi Corpus in Poor Resource Scenario

  • Modi, Deepa;Nain, Neeta;Nehra, Maninder
    • Journal of Multimedia Information System
    • /
    • v.5 no.3
    • /
    • pp.147-154
    • /
    • 2018
  • Natural language processing (NLP) is an emerging research area in which we study how machines can be used to perceive and alter the text written in natural languages. We can perform different tasks on natural languages by analyzing them through various annotational tasks like parsing, chunking, part-of-speech tagging and lexical analysis etc. These annotational tasks depend on morphological structure of a particular natural language. The focus of this work is part-of-speech tagging (POS tagging) on Hindi language. Part-of-speech tagging also known as grammatical tagging is a process of assigning different grammatical categories to each word of a given text. These grammatical categories can be noun, verb, time, date, number etc. Hindi is the most widely used and official language of India. It is also among the top five most spoken languages of the world. For English and other languages, a diverse range of POS taggers are available, but these POS taggers can not be applied on the Hindi language as Hindi is one of the most morphologically rich language. Furthermore there is a significant difference between the morphological structures of these languages. Thus in this work, a POS tagger system is presented for the Hindi language. For Hindi POS tagging a hybrid approach is presented in this paper which combines "Probability-based and Rule-based" approaches. For known word tagging a Unigram model of probability class is used, whereas for tagging unknown words various lexical and contextual features are used. Various finite state machine automata are constructed for demonstrating different rules and then regular expressions are used to implement these rules. A tagset is also prepared for this task, which contains 29 standard part-of-speech tags. The tagset also includes two unique tags, i.e., date tag and time tag. These date and time tags support all possible formats. Regular expressions are used to implement all pattern based tags like time, date, number and special symbols. The aim of the presented approach is to increase the correctness of an automatic Hindi POS tagging while bounding the requirement of a large human-made corpus. This hybrid approach uses a probability-based model to increase automatic tagging and a rule-based model to bound the requirement of an already trained corpus. This approach is based on very small labeled training set (around 9,000 words) and yields 96.54% of best precision and 95.08% of average precision. The approach also yields best accuracy of 91.39% and an average accuracy of 88.15%.

The Study on the Meaning Change of 'Startup' and 'Entrepreneurship' using the Bigdata-based Corpus Network Analysis (빅데이터 기반 어휘연결망분석을 활용한 '창업'과 '기업가정신'의 의미변화연구)

  • Kim, Yeonjong;Park, Sanghyeok
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.16 no.4
    • /
    • pp.75-93
    • /
    • 2020
  • The purpose of this study is to extract keywords for 'startup' and 'entrepreneurship' from Naver news articles in Korea since 1990 and Google news articles in foreign countries, and to understand the changes in the meaning of entrepreneurship and entrepreneurship in each era It is aimed at doing. In summary, first, in terms of the frequency of keywords, venture sprouting is a sample of the entrepreneurial spirit of the government-led and entrepreneurs' chairman, and various technology investments and investments in corporate establishment have been made. It can be seen that training for the development of items and items was carried out, and in the case of the venture re-emergence period, it can be seen that the youth-oriented entrepreneurship and innovation through the development of various educational programs were emphasized. Second, in the result of vocabulary network analysis, the network connection and centrality of keywords in the leap period tended to be stronger than in the germination period, but the re-leap period tended to return to the level of germination. Third, in topic analysis, it can be seen that Naver keyword topics are mostly business-related content related to support, policy, and education, whereas topics through Google News consist of major keywords that are more specifically applicable to practical work.

Functional Lexical Bundles in Nuclear Science and Engineering Research Articles (원자력과학공학 학술 논문에 나타난 기능적 어휘다발 분석)

  • Nam, Daehyeon
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.11
    • /
    • pp.426-435
    • /
    • 2021
  • This study aims to functionally classify lexical bundles appearing in academic papers on nuclear science and engineering written in English and then analyze the lexical bundles' characteristics compared to those appearing in general academic papers. To this end, the texts of nuclear science and engineering papers were collected and produced as a corpus(c. 1 mil. tokens). Then they were statistically compared through Chi-square tests and standardized residuals with the corpus of general academic papers(c. 750,000 tokens). The results revealed that, compared to general academic papers, the bundles in the stance lexical bundle category were mainly used among the functional lexical bundle in nuclear science and engineering. The use of the lexical bundles lacked much variety. The same type of lexical bundles was 're-used' and 'recycled'. Based on these research results, educational implications for English for Academic Purposes and the further direction of follow-up research were discussed and suggested.

Cloning of Korean Morphological Analyzers using Pre-analyzed Eojeol Dictionary and Syllable-based Probabilistic Model (기분석 어절 사전과 음절 단위의 확률 모델을 이용한 한국어 형태소 분석기 복제)

  • Shim, Kwangseob
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.3
    • /
    • pp.119-126
    • /
    • 2016
  • In this study, we verified the feasibility of a Korean morphological analyzer that uses a pre-analyzed Eojeol dictionary and syllable-based probabilistic model. For the verification, MACH and KLT2000, Korean morphological analyzers, were cloned with a pre-analyzed eojeol dictionary and syllable-based probabilistic model. The analysis results were compared between the cloned morphological analyzer, MACH, and KLT2000. The 10 million Eojeol Sejong corpus was segmented into 10 sets for cross-validation. The 10-fold cross-validated precision and recall for cloned MACH and KLT2000 were 97.16%, 98.31% and 96.80%, 99.03%, respectively. Analysis speed of a cloned MACH was 308,000 Eojeols per second, and the speed of a cloned KLT2000 was 436,000 Eojeols per second. The experimental results indicated that a Korean morphological analyzer that uses a pre-analyzed eojeol dictionary and syllable-based probabilistic model could be used in practical applications.

Enhanced Sign Language Transcription System via Hand Tracking and Pose Estimation

  • Kim, Jung-Ho;Kim, Najoung;Park, Hancheol;Park, Jong C.
    • Journal of Computing Science and Engineering
    • /
    • v.10 no.3
    • /
    • pp.95-101
    • /
    • 2016
  • In this study, we propose a new system for constructing parallel corpora for sign languages, which are generally under-resourced in comparison to spoken languages. In order to achieve scalability and accessibility regarding data collection and corpus construction, our system utilizes deep learning-based techniques and predicts depth information to perform pose estimation on hand information obtainable from video recordings by a single RGB camera. These estimated poses are then transcribed into expressions in SignWriting. We evaluate the accuracy of hand tracking and hand pose estimation modules of our system quantitatively, using the American Sign Language Image Dataset and the American Sign Language Lexicon Video Dataset. The evaluation results show that our transcription system has a high potential to be successfully employed in constructing a sizable sign language corpus using various types of video resources.