• 제목/요약/키워드: Corpus analysis

검색결과 424건 처리시간 0.027초

의미 분석을 위한 말뭉치 기반의 온톨로지 학습 (Corpus-Based Ontology Learning for Semantic Analysis)

  • 강신재
    • 한국산업정보학회논문지
    • /
    • 제9권1호
    • /
    • pp.17-23
    • /
    • 2004
  • 본 논문은 한국어정보처리에서 단어의 의미를 결정하기 위한 말뭉치 기반의 온톨로지 학습 방법을 제시하고 있다. 먼저 이미 확보된 전자사전의 정보를 이용하여 단어의 확실한 의미를 우선 결정한 후, 아직 결정하지 못한 단어의 의미는 온톨로지를 이용하여 최종 결정하는 절차를 거친다. 온톨로지를 단어 의미 중의성 해소를 위한 지식베이스로 사용하기 위해서는, 온톨로지 내 개념들간의 상호정보가 말뭉치의 통계 정보에 근거하여 미리 계산된다. 계산된 상호정보 값을 가중치로 간주하면 온톨로지는 가중치 그래프로 생각할 수 있으므로, 개념간 최소 경로를 통하여 개념간 연관도를 알아 볼 수 있다. 실제 기계번역 시스템에서 본 방법은 온톨로지를 사용하지 않은 방법보다 9%의 성능 향상을 가져오는 결과를 얻을 수 있었다.

  • PDF

텍스트 내 사건-공간 표현 간 참조 관계 분석을 위한 말뭉치 주석 (Corpus Annotation for the Linguistic Analysis of Reference Relations between Event and Spatial Expressions in Text)

  • 정진우;이희진;박종철
    • 한국언어정보학회지:언어와정보
    • /
    • 제18권2호
    • /
    • pp.141-168
    • /
    • 2014
  • Recognizing spatial information associated with events expressed in natural language text is essential not only for the interpretation of such events and but also for the understanding of the relations among them. However, spatial information is rarely mentioned as compared to events and the association between event and spatial expressions is also highly implicit in a text. This would make it difficult to automate the extraction of spatial information associated with events from the text. In this paper, we give a linguistic analysis of how spatial expressions are associated with event expressions in a text. We first present issues in annotating narrative texts with reference relations between event and spatial expressions, and then discuss surface-level linguistic characteristics of such relations based on the annotated corpus to give a helpful insight into developing an automated recognition method.

  • PDF

Predicting CEFR Levels in L2 Oral Speech, Based on Lexical and Syntactic Complexity

  • Hu, Xiaolin
    • 아시아태평양코퍼스연구
    • /
    • 제2권1호
    • /
    • pp.35-45
    • /
    • 2021
  • With the wide spread of the Common European Framework of Reference (CEFR) scales, many studies attempt to apply them in routine teaching and rater training, while more evidence regarding criterial features at different CEFR levels are still urgently needed. The current study aims to explore complexity features that distinguish and predict CEFR proficiency levels in oral performance. Using a quantitative/corpus-based approach, this research analyzed lexical and syntactic complexity features over 80 transcriptions (includes A1, A2, B1 CEFR levels, and native speakers), based on an interview test, Standard Speaking Test (SST). ANOVA and correlation analysis were conducted to exclude insignificant complexity indices before the discriminant analysis. In the result, distinctive differences in complexity between CEFR speaking levels were observed, and with a combination of six major complexity features as predictors, 78.8% of the oral transcriptions were classified into the appropriate CEFR proficiency levels. It further confirms the possibility of predicting CEFR level of L2 learners based on their objective linguistic features. This study can be helpful as an empirical reference in language pedagogy, especially for L2 learners' self-assessment and teachers' prediction of students' proficiency levels. Also, it offers implications for the validation of the rating criteria, and improvement of rating system.

한국어 학습자 대상 관형격 조사 '의'의 교육 내용 재고: 학습자 말뭉치에 나타난 오류를 바탕으로 (A Study to Rethink the Components of Teaching Korean Genitive Particle '의': Based on the Errors in Korean Learners' Corpus)

  • 이수현;심지영
    • 한국산업융합학회 논문집
    • /
    • 제26권3호
    • /
    • pp.443-454
    • /
    • 2023
  • The purpose of this study is to reveal the Korean learners' usage pattern of '의', the genitive particle, according to semantic classification, so that it can be referred to in determining the contents and methods of related education. The method of this study adopts a quantitative analysis using learners corpus established by National Institute of Korean Language. As a result of the analysis, as proficiency increases, the overall frequency of '의' increases and the number of meaning senses used increases. However, the frequency of errors also increases with it. As for the usage pattern of each sense, the meaning of 'ownership, belonging' is the most frequent, and followed by 'acting entity', 'kinship, social relations', and 'relationship(area)'. In conclusion, the meanings of 'acting subjects' and 'relationships(area) need to be supplemented with explicit education. Other meanings need to be discussed, and decisions should be made in consideration of learning purpose and proficiency.

텍스트 마이닝 기법을 활용한 ECDIS 사고보고서 분석 (Text Mining Analysis Technique on ECDIS Accident Report)

  • 이정석;이보경;조익순
    • 해양환경안전학회지
    • /
    • 제25권4호
    • /
    • pp.405-412
    • /
    • 2019
  • SOLAS에서는 국제 항해에 종사하는 총톤수 500톤 이상의 선박에 대하여 2018년 7월 1일 이후 도래하는 최초 검사까지 ECDIS를 설치해야 한다고 규정하고 있다. 새로운 주요 항해 장비로 ECDIS가 탑재되면서 ECDIS 사용에 관련한 다양한 사고가 발생하고 있다. MAIB, BSU, BEAmer, DMAIB, DSB에서 발행한 12가지의 사고보고서에는 항해사의 운용 미숙과 ECDS 시스템의 사고 원인으로 분석하였고, 사고 원인과 관련된 단어들을 정량적으로 분석하기 위해 R-프로그램을 사용하여 텍스트를 분석하였다. 도출 빈도에 따른 단어의 중요도를 나타내기 위해 텍스트 마이닝 기법인 단어 구름, 단어 연관성, 단어 가중치의 방법을 사용하였다. 단어 구름은 사용된 단어들의 빈도수를 구름 형태로 나타내는 방법으로써 N-gram 모델을 적용하였다. N-gram 모델 중 Uni-gram 분석 결과 ECDIS 단어, Bi-gram 분석 결과는 Safety Contour 단어의 사용 빈도가 가장 많았다. Bi-gram 분석을 기반으로 사고 원인 단어를 항해사와 ECDIS 시스템으로 구분하고, 연관된 단어들을 단어 연관성으로 나타내었다. 마지막으로 항해사와 ECDIS 시스템에 연관된 단어들을 단어 말뭉치로 구성한 후 단어 가중치를 적용하여 연도별 말뭉치 빈도 변화를 분석하였다. 추세선 그래프로 말뭉치 변화 경향을 분석한 결과, 항해사 말뭉치는 최근으로 올수록 감소하였으며 반대로 ECDIS 시스템 말뭉치는 점점 증가함을 나타내었다.

MHC Class II+ (HLA-DP-like) Cells in the Cow Reproductive Tract: I. Immunolocalization and Distribution of MHC Class II+ Cells in Uterus at Different Phases of the Estrous Cycle

  • Eren, U.;Sandikci, M.;Kum, S.;Eren, V.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • 제21권1호
    • /
    • pp.35-41
    • /
    • 2008
  • This study was undertaken to investigate the distribution of major histocompatibility complex class II positive (MHC II+) (HLA-DP-like) cells in the cow uterus (cervix, corpus and cornu uteri) and to compare these cells between the estrus and diestrus phases of the estrous cycle. Twenty-nine multiparous cows were used. Tissue samples from the middle of the cervix, the corpus and the right cornu were taken immediately after slaughter at the estrus or diestrus phase. Streptavidin-biotin peroxidase complex staining was used to detect MHC II+ cells. The number of MHC II+ cells per unit area of tissue was counted using image analysis software under a light microscope. Numerous MHC II+ cells were found in the endometrium (cervix, corpus and cornu uteri) in both estrus and diestrus. MHC II+ cells were found in the surface epithelium of the cervix uteri in diestrus, but in the corpus uteri in both estrus and diestrus and in the cornu uteri in estrus. MHC II+ cells were also found freely in the lumen of the glands and between the gland epithelia of the corpus and cornu uteri in both estrus and diestrus. There were also MHC II+ cells in the connective tissue of the myometrium and perimetrium (outside the endometrium) and around the blood vessels. Endothelial cells were frequently positive for MHC II staining. More MHC II+ cells were found in the endometrium than outside the endometrium in both estrus and diestrus (p<0.001). However, there was no difference in the numbers of positive cells between estrus and diestrus either in the endometrium or outside it. These results are the first evidence for HLA-DP-like MHC II+ cells in the bovine uterus. They indicate that antigen presentation by HLA-DP-like MHC II+ cells of the uterus is not influenced by hormonal status.

The f0 distribution of Korean speakers in a spontaneous speech corpus

  • Yang, Byunggon
    • 말소리와 음성과학
    • /
    • 제13권3호
    • /
    • pp.31-37
    • /
    • 2021
  • The fundamental frequency, or f0, is an important acoustic measure in the prosody of human speech. The current study examined the f0 distribution of a corpus of spontaneous speech in order to provide normative data for Korean speakers. The corpus consists of 40 speakers talking freely about their daily activities and their personal views. Praat scripts were created to collect f0 values, and a majority of obvious errors were corrected manually by watching and listening to the f0 contour on a narrow-band spectrogram. Statistical analyses of the f0 distribution were conducted using R. The results showed that the f0 values of all the Korean speakers were right-skewed, with a pointy distribution. The speakers produced spontaneous speech within a frequency range of 274 Hz (from 65 Hz to 339 Hz), excluding statistical outliers. The mode of the total f0 data was 102 Hz. The female f0 range, with a bimodal distribution, appeared wider than that of the male group. Regression analyses based on age and f0 values yielded negligible R-squared values. As the mode of an individual speaker could be predicted from the median, either the median or mode could serve as a good reference for the individual f0 range. Finally, an analysis of the continuous f0 points of intonational phrases revealed that the initial and final segments of the phrases yielded several f0 measurement errors. From these results, we conclude that an examination of a spontaneous speech corpus can provide linguists with useful measures to generalize acoustic properties of f0 variability in a language by an individual or groups. Further studies would be desirable of the use of statistical measures to secure reliable f0 values of individual speakers.

북한 제1중학교 영어교과서 분석 (Analysis of the English Textbooks in North Korean First Middle School)

  • 황서연;김정렬
    • 한국콘텐츠학회논문지
    • /
    • 제17권11호
    • /
    • pp.242-251
    • /
    • 2017
  • 본 연구는 북한의 수재양성 기관인 제1중학교의 영어교과서를 코퍼스로 구축한 후, 이를 분석하여 언어적 특징을 파악한 연구이다. 그동안 북한의 일반중학교의 영어교과서의 특징들을 파악한 연구는 많았지만, 북한의 수재교육기관인 제1중학교 영어교과서에 대한 연구는 부족했다. 이를 위하여 북한자료센터에서 입수한 제1중학교 1학년, 2학년, 4학년, 6학년 영어 교과서 구성 체계를 살펴보고, 코퍼스를 구축한 후, 워드스미스 툴스 7.0을 활용하여 제1중학교 영어 교과서의 언어적인 특징과 고빈도 내용어를 분석하였다. 기본적인 통계 정보를 살펴본 결과, 학년의 위계에 따라 어휘 수가 증가하지는 않았으나 어휘다양성은 고학년으로 갈수록 순차적으로 높아지는 경향성이 발견되었다. 한편 학년별 고빈도 내용어의 분포를 살펴본 결과, 각 학년별 교과서에 수록된 지문의 주제에 따라 학년별로 큰 차이를 보였다.

딥러닝 및 토픽모델링 기법을 활용한 소셜 미디어의 자살 경향 문헌 판별 및 분석 (Examining Suicide Tendency Social Media Texts by Deep Learning and Topic Modeling Techniques)

  • 고영수;이주희;송민
    • 한국비블리아학회지
    • /
    • 제32권3호
    • /
    • pp.247-264
    • /
    • 2021
  • 자살은 전 세계 사망 원인 중 4위이며 사회, 경제적 손실이 큰 난제이다. 본 연구는 자살 예방을 위하여 소셜미디어에 나타난 자살 관련 말뭉치를 구축하고 이를 통해 자살 경향 문헌을 분류할 수 있는 딥러닝 자동분류 모델을 만들고자 하였다. 또한, 자살 요인을 분석하기 위해 주제를 자동으로 추출하는 분석 기법인 토픽모델링을 활용하여 자살 관련 말뭉치를 세부 주제로 분류하고자 하였다. 이를 위해 소셜미디어 중 하나인 네이버 지식iN에 나타난 자살 관련 문헌 2,011개를 수집한 후 자살예방교육 매뉴얼을 기준으로 자살 경향 문헌 및 비경향 문헌 여부를 주석 처리하였으며, 이 데이터를 딥러닝 모델(LSTM, BERT, ELECTRA)로 학습시켜 자동분류 모델을 만들었다. 또한, 토픽모델링 기법의 하나인 LDA 기법으로 주제별 문헌을 분류하여 자살 요인을 발견하였고 이를 심층적으로 분석하기 위해 주제별로 동시출현 단어 분석 및 네트워크 시각화를 진행하였다.

Morphological differences between Water deer and Sika deer ovaries during estrus and pregnancy

  • Ji-Hye Lee;Yong-Su Park;Min-Gee Oh;Sang-Hwan Kim
    • 한국동물생명공학회지
    • /
    • 제38권2호
    • /
    • pp.62-69
    • /
    • 2023
  • Background: Research on the reproductive physiology of Water and Sika deer, an endemic in Korea, still needs to be completed. This study analyzed the ovarian development and morphological characteristics of wild Water deer and Sika deer. Methods: Water deer and Sika deer ovaries were collected from the Korean Peninsula and Russia-Korean Peninsula border during the estrus and pregnancy seasons, respectively. And, morphological and physiological analysis and immunohistochemistry were conducted to confirm the detection of Ca2+ and assess the morphological changes in the ovaries. Results: The results of morphological analysis of ovaries during pregnancy and estrus, the development of the corpus luteum and follicles of Water deer showed similar patterns to other mammals. In contrast, the corpus luteum of Sika deer differed in tissue morphology and composition from Water deer. Ca2+ related to tissue metabolism was detected in the theca cells zone of Water deer on the estrus and was highly detected in the luteum cells zone during pregnancy. The hormone receptor protein expression patterns were generally higher in the ovaries of Water deer on the estrus and the pregnancy than in Sika deer. The expression of LH receptor was relatively low in the lutein cell zone, unlikely that of Water deer. The expression of VEGF was also different from Water deer, and the response in Sika deer was relatively very low compared to Water deer in expressing all proteins-related development. Conclusions: Therefore, the results of the study were shown that the composition of the corpus luteum of Sika deer is not clear compared to Water deer, and there are many differences in the functional and morphological formation of the corpus luteum.