• 제목/요약/키워드: Corpus-based Analysis

검색결과 202건 처리시간 0.022초

트위터 문서에서 시간 및 리트윗 분석을 통한 핵심 사건 추출 (Extracting Core Events Based on Timeline and Retweet Analysis in Twitter Corpus)

  • ;이경순
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제1권1호
    • /
    • pp.69-74
    • /
    • 2012
  • 인터넷 사용자들은 어떠한 이슈에 대해 소셜 네트워크 서비스를 통해 빠르고 간결하게 다른 사람들과 지속적인 커뮤니케이션을 원한다. 사회적 이슈에 대해 어떠한 사건이 일어나게 되면 그날의 트윗 글과 리트윗 개수에 영향을 미치게 된다. 본 논문에서는 트위터 자료에서 사회적인 핵심 사건을 추출하기 위해 시간 분석과 감성 자질 및 리트윗 정보를 이용하는 방법을 제안한다. 제안 방법의 유효성을 검증하기 위해 비교실험으로 어휘 빈도수를 이용하여 핵심 사건을 추출하는 방법, 어휘 빈도수와 감성 자질을 함께 이용한 방법, 시간 분석을 반영하기 위해 카이제곱만을 이용한 방법과 제안 방법인 어휘 빈도수, 감성 자질, 리트윗 및 카이제곱을 함께 이용한 방법으로 성능을 비교하였다. 성능 평가를 위해서는 추출된 사건리스트에서 상위 10개 결과에서 정확도를 계산하였는데, 제안 방법이 94.9%의 성능을 보였다. 실험을 통해 제안한 방법이 핵심 사건 추출에 효과적인 방법임을 알 수 있다.

감정점수의 전파를 통한 한국어 감정사전 생성 (Generating a Korean Sentiment Lexicon Through Sentiment Score Propagation)

  • 박호민;김창현;김재훈
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제9권2호
    • /
    • pp.53-60
    • /
    • 2020
  • 감정분석은 문서 또는 대화상에서 주어진 주제에 대한 태도와 의견을 이해하는 과정이다. 감정분석에는 다양한 접근법이 있다. 그 중 하나는 감정사전을 이용하는 사전 기반 접근법이다. 본 논문에서는 널리 알려진 영어 감정사전인 VADER를 활용하여 한국어 감정사전을 자동으로 생성하는 방법을 제안한다. 제안된 방법은 세 단계로 구성된다. 첫 번째 단계는 한영 병렬 말뭉치를 사용하여 한영 이중언어 사전을 제작한다. 제작된 이중언어 사전은 VADER 감정어와 한국어 형태소 쌍들의 집합이다. 두 번째 단계는 그 이중언어 사전을 사용하여 한영 단어 그래프를 생성한다. 세 번째 단계는 생성된 단어 그래프 상에서 레이블 전파 알고리즘을 실행하여 새로운 감정사전을 구축한다. 이와 같은 과정으로 생성된 한국어 감정사전을 유용성을 보이려고 몇 가지 실험을 수행하였다. 본 논문에서 생성된 감정사전을 이용한 감정 분류기가 기존의 기계학습 기반 감정분류기보다 좋은 성능을 보였다. 앞으로 본 논문에서 제안된 방법을 적용하여 여러 언어의 감정사전을 생성하려고 한다.

The Role of Distributional Cues in the Acquisition of Verb Argument Structures

  • Kim, Mee-Sook
    • 한국언어정보학회지:언어와정보
    • /
    • 제7권1호
    • /
    • pp.87-99
    • /
    • 2003
  • This paper investigates the role of input frequency in the acquisition of verb argument structures based on distributional information of a corpus of utterances derived from the English CHILDES database (MacWhinney 1993). It has been widely accepted that children successfully learn verb argument structures by innate language mechanisms, such as linking rules which connect verb meanings and its syntactic structures. In contrast, an approach to language acquisition called “statistical language learning” has currently claimed that children could succeed in acquiring syntactic structures in the absence of innate language mechanisms, making use of distributional properties of the input. In this paper, I evaluate the feasibility of the statistical learning in acquiring verb argument structures, based on distributional information about locative verbs in parental input. The naturalistic data allow us to investigate to what extent the statistical learning approach can and cannot help children succeed in learning the syntax of locative verbs. Based on the results of English database analysis, I show that there is rich statistical information for learning the syntactic possibilities of locative verbs in parental input, despite some limitations in the statistical learning approach.

  • PDF

코퍼스 기반 한국어 합성기의 억양 구현 방안 (A Method of Intonation Modeling for Corpus-Based Korean Speech Synthesizer)

  • 김진영;박상언;엄기완;최승호
    • 음성과학
    • /
    • 제7권2호
    • /
    • pp.193-208
    • /
    • 2000
  • This paper describes a multi-step method of intonation modeling for corpus-based Korean speech synthesizer. We selected 1833 sentences considering various syntactic structures and built a corresponding speech corpus uttered by a female announcer. We detected the pitch using laryngograph signals and manually marked the prosodic boundaries on recorded speech, and carried out the tagging of part-of-speech and syntactic analysis on the text. The detected pitch was separated into 3 frequency bands of low, mid, high frequency components which correspond to the baseline, the word tone, and the syllable tone. We predicted them using the CART method and the Viterbi search algorithm with a word-tone-dictionary. In the collected spoken sentences, 1500 sentences were trained and 333 sentences were tested. In the layer of word tone modeling, we compared two methods. One is to predict the word tone corresponding to the mid-frequency components directly and the other is to predict it by multiplying the ratio of the word tone to the baseline by the baseline. The former method resulted in a mean error of 12.37 Hz and the latter in one of 12.41 Hz, similar to each other. In the layer of syllable tone modeling, it resulted in a mean error rate less than 8.3% comparing with the mean pitch, 193.56 Hz of the announcer, so its performance was relatively good.

  • PDF

텍스트 내 사건-공간 표현 간 참조 관계 분석을 위한 말뭉치 주석 (Corpus Annotation for the Linguistic Analysis of Reference Relations between Event and Spatial Expressions in Text)

  • 정진우;이희진;박종철
    • 한국언어정보학회지:언어와정보
    • /
    • 제18권2호
    • /
    • pp.141-168
    • /
    • 2014
  • Recognizing spatial information associated with events expressed in natural language text is essential not only for the interpretation of such events and but also for the understanding of the relations among them. However, spatial information is rarely mentioned as compared to events and the association between event and spatial expressions is also highly implicit in a text. This would make it difficult to automate the extraction of spatial information associated with events from the text. In this paper, we give a linguistic analysis of how spatial expressions are associated with event expressions in a text. We first present issues in annotating narrative texts with reference relations between event and spatial expressions, and then discuss surface-level linguistic characteristics of such relations based on the annotated corpus to give a helpful insight into developing an automated recognition method.

  • PDF

한국어 학습자 대상 관형격 조사 '의'의 교육 내용 재고: 학습자 말뭉치에 나타난 오류를 바탕으로 (A Study to Rethink the Components of Teaching Korean Genitive Particle '의': Based on the Errors in Korean Learners' Corpus)

  • 이수현;심지영
    • 한국산업융합학회 논문집
    • /
    • 제26권3호
    • /
    • pp.443-454
    • /
    • 2023
  • The purpose of this study is to reveal the Korean learners' usage pattern of '의', the genitive particle, according to semantic classification, so that it can be referred to in determining the contents and methods of related education. The method of this study adopts a quantitative analysis using learners corpus established by National Institute of Korean Language. As a result of the analysis, as proficiency increases, the overall frequency of '의' increases and the number of meaning senses used increases. However, the frequency of errors also increases with it. As for the usage pattern of each sense, the meaning of 'ownership, belonging' is the most frequent, and followed by 'acting entity', 'kinship, social relations', and 'relationship(area)'. In conclusion, the meanings of 'acting subjects' and 'relationships(area) need to be supplemented with explicit education. Other meanings need to be discussed, and decisions should be made in consideration of learning purpose and proficiency.

A Deterministic Method for Structural Analysis of Compound Words in Japanese

  • Han, Dongli;Ito, Takeshi;Furugori, Teiji
    • 한국언어정보학회:학술대회논문집
    • /
    • 한국언어정보학회 2002년도 Language, Information, and Computation Proceedings of The 16th Pacific Asia Conference
    • /
    • pp.79-91
    • /
    • 2002
  • Structural analysis of compound words is necessary and an important process in natural language processing. Proposed here is a corpus- and statistics- based method for the structural analysis of compound words in Japanese. We determine the structure of a compound word by using Internet corpus and calculating the strength of word association among its constituent words. Experiments with 5, 6, 7, and 8 kanji compound words show that our method works well and its performance is better than those of other comparable studies.

  • PDF

창업 온톨로지 구축을 위한 벤처창업 연구의 지식구조 분석 (An Analysis of the Intellectual Structure of Venture-Creation Studies to build an Entrepreneurship Ontology)

  • 심재후;최명길
    • 지식경영연구
    • /
    • 제14권4호
    • /
    • pp.75-86
    • /
    • 2013
  • The deeping interests and research toward Entrepreneurship, which is considered as an potential alternative for solving the continuing economic recession in the $21^{st}$ century, have grown. The process and methodology of the research could not be systematically arranged and the results of the research lack in efforts on the application of increasing suceess ratio in starting new business. This study adopted corpus methodology, through which we try to analyzes the knowledge structure in entrepreneurship research, derive essential concepts and the consisting domains in venture research. Based on the results of analysis, this study constructs the knowledge structure of venture research in a form of knowledge ontology. The results of the study could be a ground for entrepreneurship research and utilized as implication for a creation of construction for the entrepreneurship knowledge ontology.

  • PDF

Digital Application of Intangible Cultural Heritage from the Perspective of Cultural Ecology

  • Jing, Xiuli;Tan, Fang;Zhang, Mu
    • Journal of Smart Tourism
    • /
    • 제1권1호
    • /
    • pp.41-52
    • /
    • 2021
  • This paper explored the digital application of intangible cultural heritage from the perspective of cultural ecology. Through field investigations, combined with cultural ecology theory, an ontology-based semantic web technology was proposed, and Nanjing "Yunjin" brocade weaving technique was selected as the research object. The specific steps were as follows: First, based on the field surveys and cultural ecology theory, the intangible cultural ecological environment was divided into natural and social environments. Next, constructing the intangible cultural heritage ontology was constructed, including the collection and collation of Nanjing Yunjin weaving technique knowledge corpus, based on user needs analysis and corpus analysis, CIDOC CRM was used to create rules to build the ontology. Finally, based on the MediaWiki platform and Semantic MediaWiki, the semantic web model of the intangible cultural heritage was designed, and its semantic retrieval function was realized, thereby achieving the practical application of intangible cultural heritage digitization. Based on the perspective of cultural ecology, a set of intangible digital application models was proposed, which expanded the digital application of the cultural ecology theory, verified the application of this model in the sustainable development of cultural tourism, and provided reference for the sustainable development of cultural tourism.

The f0 distribution of Korean speakers in a spontaneous speech corpus

  • Yang, Byunggon
    • 말소리와 음성과학
    • /
    • 제13권3호
    • /
    • pp.31-37
    • /
    • 2021
  • The fundamental frequency, or f0, is an important acoustic measure in the prosody of human speech. The current study examined the f0 distribution of a corpus of spontaneous speech in order to provide normative data for Korean speakers. The corpus consists of 40 speakers talking freely about their daily activities and their personal views. Praat scripts were created to collect f0 values, and a majority of obvious errors were corrected manually by watching and listening to the f0 contour on a narrow-band spectrogram. Statistical analyses of the f0 distribution were conducted using R. The results showed that the f0 values of all the Korean speakers were right-skewed, with a pointy distribution. The speakers produced spontaneous speech within a frequency range of 274 Hz (from 65 Hz to 339 Hz), excluding statistical outliers. The mode of the total f0 data was 102 Hz. The female f0 range, with a bimodal distribution, appeared wider than that of the male group. Regression analyses based on age and f0 values yielded negligible R-squared values. As the mode of an individual speaker could be predicted from the median, either the median or mode could serve as a good reference for the individual f0 range. Finally, an analysis of the continuous f0 points of intonational phrases revealed that the initial and final segments of the phrases yielded several f0 measurement errors. From these results, we conclude that an examination of a spontaneous speech corpus can provide linguists with useful measures to generalize acoustic properties of f0 variability in a language by an individual or groups. Further studies would be desirable of the use of statistical measures to secure reliable f0 values of individual speakers.