• Title/Summary/Keyword: Lexical model

Search Result 100, Processing Time 0.025 seconds

Automatic Classification and Vocabulary Analysis of Political Bias in News Articles by Using Subword Tokenization (부분 단어 토큰화 기법을 이용한 뉴스 기사 정치적 편향성 자동 분류 및 어휘 분석)

  • Cho, Dan Bi;Lee, Hyun Young;Jung, Won Sup;Kang, Seung Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.1
    • /
    • pp.1-8
    • /
    • 2021
  • In the political field of news articles, there are polarized and biased characteristics such as conservative and liberal, which is called political bias. We constructed keyword-based dataset to classify bias of news articles. Most embedding researches represent a sentence with sequence of morphemes. In our work, we expect that the number of unknown tokens will be reduced if the sentences are constituted by subwords that are segmented by the language model. We propose a document embedding model with subword tokenization and apply this model to SVM and feedforward neural network structure to classify the political bias. As a result of comparing the performance of the document embedding model with morphological analysis, the document embedding model with subwords showed the highest accuracy at 78.22%. It was confirmed that the number of unknown tokens was reduced by subword tokenization. Using the best performance embedding model in our bias classification task, we extract the keywords based on politicians. The bias of keywords was verified by the average similarity with the vector of politicians from each political tendency.

Effects of Association and Imagery on Word Recognition (단어재인에 미치는 연상과 심상성의 영향)

  • Kim, Min-Jung;Lee, Seung-Bok;Jung, Bum-Suk
    • Korean Journal of Cognitive Science
    • /
    • v.20 no.3
    • /
    • pp.243-274
    • /
    • 2009
  • The association, word frequency and imagery have been considered as the main factors that affect the word recognition. The present study aimed to examine the imagery effect and the interaction of the association effect while controlling the frequency effect. To explain the imagery effect, we compared the two theories (dual-coding theory, context availability model). The lexical decision task using priming paradigm was administered. The duration of prime words was manipulated as 20ms, 50ms, and 450ms in experiments 1, 2, and 3, respectively. The association and imagery of prime words were manipulated as the main factors in each of the three experiments. In experiment 1, the duration of prime words (20ms) which is expected to not activate the semantic context enough to affects the word recognition was used. As a result, only imagery effect was statically significant. In experiment 2, the duration of prime word was 50ms, which we expected to activate the semantic context without perceptual awareness. The result showed both the association and imagery effects. The interaction between the two effects was also significant. In experiment 3, to activate the semantic context with perceptual awareness, the prime words were presented for 450ms. Only association effect was statically significant in this experimental condition. The results of the three experiments suggest that the influence of the imagery was at the early stages of word recognition, while the association effect appeared rather later than the imagery. These results implied that the two theories are not contrary to each other. The dual-coding theory just concerned imagery effect which affects the early stage of word recognition, and context-availability model is more for the semantic context effect which affects rather later stage of word recognition. To explain the word recognition process more completely, some integrated model need to be developed considering not only the main 3 effects but also the stages which extends along the time course of the process.

  • PDF

Building and Analyzing Panic Disorder Social Media Corpus for Automatic Deep Learning Classification Model (딥러닝 자동 분류 모델을 위한 공황장애 소셜미디어 코퍼스 구축 및 분석)

  • Lee, Soobin;Kim, Seongdeok;Lee, Juhee;Ko, Youngsoo;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.38 no.2
    • /
    • pp.153-172
    • /
    • 2021
  • This study is to create a deep learning based classification model to examine the characteristics of panic disorder and to classify the panic disorder tendency literature by the panic disorder corpus constructed for the present study. For this purpose, 5,884 documents of the panic disorder corpus collected from social media were directly annotated based on the mental disease diagnosis manual and were classified into panic disorder-prone and non-panic-disorder documents. Then, TF-IDF scores were calculated and word co-occurrence analysis was performed to analyze the lexical characteristics of the corpus. In addition, the co-occurrence between the symptom frequency measurement and the annotated symptom was calculated to analyze the characteristics of panic disorder symptoms and the relationship between symptoms. We also conducted the performance evaluation for a deep learning based classification model. Three pre-trained models, BERT multi-lingual, KoBERT, and KcBERT, were adopted for classification model, and KcBERT showed the best performance among them. This study demonstrated that it can help early diagnosis and treatment of people suffering from related symptoms by examining the characteristics of panic disorder and expand the field of mental illness research to social media.

Loaming Syntactic Constraints for Improving the Efficiency of Korean Parsing (한국어 구문분석의 효율성을 개선하기 위한 구문제약규칙의 학습)

  • Park, So-Young;Kwak, Yong-Jae;Chung, Hoo-Jung;Hwang, Young-Sook;Rim, Hae-Chang
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.10
    • /
    • pp.755-765
    • /
    • 2002
  • In this paper, we observe various syntactic information for Korean parsing and propose a method to learn constraints and improve the efficiency of a parsing model by using the constraints. The proposed method has the following three characteristics. First, it improves the parsing efficiency since we use constraints that can prevent the parser from generating unsuitable candidates. Second, it is robust on a given Korean sentence because the attributes for the constraints are selected based on the syntactic and lexical idiosyncrasy of Korean. Third, it is easy to acquire constraints automatically from a treebank by using a decision tree learning algorithm. The experimental results show that the parser using acquired constraints can reduce the number of overgenerated candidates up to 1/2~1/3 of candidates and it runs 2~3 times faster than the one without any constraints.

Processing of Korean Compounds with Saisios (사이시옷이 단어 재인에 미치는 영향)

  • Bae, Sung-Bong;Yi, Kwang-Oh
    • Korean Journal of Cognitive Science
    • /
    • v.23 no.3
    • /
    • pp.349-366
    • /
    • 2012
  • Two experiments were conducted to examine the processing of Korean compounds in relation to saisios. Saisios is a letter interposed between constituents when a phonological change takes place on the onset of the first syllable of the second constituent. This saisios rule is often violated by writers, resulting in many words having two spellings: one with saisios and the other without saisios. Among two spellings, some words are more familiar with saisios, some are usually spelled without saisios, and some are balanced. In Experiment 1 using the go/no-go lexical decision task, participants were asked to judge compounds with/without saisios. Saisios-dominant words (나뭇잎 > 나무잎) were responded faster when they appeared with saisios, whereas the opposite was true for words that usually appear without saisios (북엇국 < 북어국). In experiment 2, we presented participants compound words that were balanced on saisios. The results showed that words without saisios were responded faster than words with saisios. To summarize, the results of Experiment 1 and 2 were consistent with the APPLE model. Some problems related to the saisios rule were discussed in terms of reading process.

  • PDF

A Multi-level Representation of the Korean Narrative Text Processing and Construction-Integration Theory: Morpho- syntactic and Discourse-Pragmatic Effects of Verb Modality on Topic Continuity (한국어 서사 텍스트 처리의 다중 표상과 구성 통합 이론: 주제어 연속성에 대한 양태 어미의 형태 통사적, 담화 화용적 기능)

  • Cho Sook-Whan;Kim Say-Young
    • Korean Journal of Cognitive Science
    • /
    • v.17 no.2
    • /
    • pp.103-118
    • /
    • 2006
  • The main purpose of this paper is to investigate the effects of discourse topic and morpho-syntactic verbal information on the resolution of null pronouns in the Korean narrative text within the framework of the construction-integration theory (Kintsch, 1988, Singer & Kintsch, 2001, Graesser, Gernsbacher, & Goldman. 2003). For the purpose of this paper, two conditions were designed: an explicit condition with both a consistently maintained discourse topic and the person-specific verb modals on one hand, and a neutral condition with no discourse topic or morpho-syntactic information provided, on the other. We measured the reading tines far the target sentence containing a null pronoun and the question response times for finding an antecedent, and the accuracy rates for finding an antecedent. During the experiments each passage was presented at a tine on a computer-controlled display. Each new sentence was presented on the screen at the moment the participant pressed the button on the computer keyboard. Main findings indicate that processing is facilitated by macro-structure (topicality) in conjunction with micro-structure (morpho-syntax) in pronoun interpretation. It is speculated that global processing alone may not be able to determine which potential antecedent is to be focused unless aided by lexical information. It is argued that the results largely support the resonance-based model, but not the minimalist hypothesis.

  • PDF

Korean Part-Of-Speech Tagging by using Head-Tail Tokenization (Head-Tail 토큰화 기법을 이용한 한국어 품사 태깅)

  • Suh, Hyun-Jae;Kim, Jung-Min;Kang, Seung-Shik
    • Smart Media Journal
    • /
    • v.11 no.5
    • /
    • pp.17-25
    • /
    • 2022
  • Korean part-of-speech taggers decompose a compound morpheme into unit morphemes and attach part-of-speech tags. So, here is a disadvantage that part-of-speech for morphemes are over-classified in detail and complex word types are generated depending on the purpose of the taggers. When using the part-of-speech tagger for keyword extraction in deep learning based language processing, it is not required to decompose compound particles and verb-endings. In this study, the part-of-speech tagging problem is simplified by using a Head-Tail tokenization technique that divides only two types of tokens, a lexical morpheme part and a grammatical morpheme part that the problem of excessively decomposed morpheme was solved. Part-of-speech tagging was attempted with a statistical technique and a deep learning model on the Head-Tail tokenized corpus, and the accuracy of each model was evaluated. Part-of-speech tagging was implemented by TnT tagger, a statistical-based part-of-speech tagger, and Bi-LSTM tagger, a deep learning-based part-of-speech tagger. TnT tagger and Bi-LSTM tagger were trained on the Head-Tail tokenized corpus to measure the part-of-speech tagging accuracy. As a result, it showed that the Bi-LSTM tagger performs part-of-speech tagging with a high accuracy of 99.52% compared to 97.00% for the TnT tagger.

A Study on the Continuous Speech Recognition for the Automatic Creation of International Phonetics (국제 음소의 자동 생성을 활용한 연속음성인식에 관한 연구)

  • Kim, Suk-Dong;Hong, Seong-Soo;Shin, Chwa-Cheul;Woo, In-Sung;Kang, Heung-Soon
    • Journal of Korea Game Society
    • /
    • v.7 no.2
    • /
    • pp.83-90
    • /
    • 2007
  • One result of the trend towards globalization is an increased number of projects that focus on natural language processing. Automatic speech recognition (ASR) technologies, for example, hold great promise in facilitating global communications and collaborations. Unfortunately, to date, most research projects focus on single widely spoken languages. Therefore, the cost to adapt a particular ASR tool for use with other languages is often prohibitive. This work takes a more general approach. We propose an International Phoneticizing Engine (IPE) that interprets input files supplied in our Phonetic Language Identity (PLI) format to build a dictionary. IPE is language independent and rule based. It operates by decomposing the dictionary creation process into a set of well-defined steps. These steps reduce rule conflicts, allow for rule creation by people without linguistics training, and optimize run-time efficiency. Dictionaries created by the IPE can be used with the speech recognition system. IPE defines an easy-to-use systematic approach that can obtained 92.55% for the recognition rate of Korean speech and 89.93% for English.

  • PDF

A Study on the Role of Models and Reformulations in L2 Learners' Noticing and Their English Writing (제2 언어학습자의 주목 및 영어 글쓰기에 대한 모델글과 재구성글의 역할에 관한 연구)

  • Hwang, Hee Jeong
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.10
    • /
    • pp.426-436
    • /
    • 2022
  • This study aimed to explore the role of models and reformulations as feedback to English writing in L2 learners' noticing and their writing. 92 participants were placed into three groups; a models group (MG), a reformulations group (RG), a control group (CG), involved in a three-stage writing task. In stage 1, they were asked to perform a 1st draft of writing, while taking notes on the problems they experienced. In stage 2, the MG was asked to compare their writing with a model text and the RG with a reformulated version of it. They were instructed to write down whatever they noticed in their comparison. The CG was asked to just read their writing. In stage 3, all the participants attempted subsequent revisions. The results indicated that all the participants noticed problematic linguistic features the most in a lexical category, and models and reformulations led to higher rate of noticing the problematic linguistic features reported in stage 1 and contributed to subsequent revisions. It was also revealed that the MG and RG significantly improved with their writings of MG and RG on the post-writing test. The findings imply that models and reformulations result in better performance in L2 writing and should be promoted in an English writing class.

A Proposal of a Keyword Extraction System for Detecting Social Issues (사회문제 해결형 기술수요 발굴을 위한 키워드 추출 시스템 제안)

  • Jeong, Dami;Kim, Jaeseok;Kim, Gi-Nam;Heo, Jong-Uk;On, Byung-Won;Kang, Mijung
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.1-23
    • /
    • 2013
  • To discover significant social issues such as unemployment, economy crisis, social welfare etc. that are urgent issues to be solved in a modern society, in the existing approach, researchers usually collect opinions from professional experts and scholars through either online or offline surveys. However, such a method does not seem to be effective from time to time. As usual, due to the problem of expense, a large number of survey replies are seldom gathered. In some cases, it is also hard to find out professional persons dealing with specific social issues. Thus, the sample set is often small and may have some bias. Furthermore, regarding a social issue, several experts may make totally different conclusions because each expert has his subjective point of view and different background. In this case, it is considerably hard to figure out what current social issues are and which social issues are really important. To surmount the shortcomings of the current approach, in this paper, we develop a prototype system that semi-automatically detects social issue keywords representing social issues and problems from about 1.3 million news articles issued by about 10 major domestic presses in Korea from June 2009 until July 2012. Our proposed system consists of (1) collecting and extracting texts from the collected news articles, (2) identifying only news articles related to social issues, (3) analyzing the lexical items of Korean sentences, (4) finding a set of topics regarding social keywords over time based on probabilistic topic modeling, (5) matching relevant paragraphs to a given topic, and (6) visualizing social keywords for easy understanding. In particular, we propose a novel matching algorithm relying on generative models. The goal of our proposed matching algorithm is to best match paragraphs to each topic. Technically, using a topic model such as Latent Dirichlet Allocation (LDA), we can obtain a set of topics, each of which has relevant terms and their probability values. In our problem, given a set of text documents (e.g., news articles), LDA shows a set of topic clusters, and then each topic cluster is labeled by human annotators, where each topic label stands for a social keyword. For example, suppose there is a topic (e.g., Topic1 = {(unemployment, 0.4), (layoff, 0.3), (business, 0.3)}) and then a human annotator labels "Unemployment Problem" on Topic1. In this example, it is non-trivial to understand what happened to the unemployment problem in our society. In other words, taking a look at only social keywords, we have no idea of the detailed events occurring in our society. To tackle this matter, we develop the matching algorithm that computes the probability value of a paragraph given a topic, relying on (i) topic terms and (ii) their probability values. For instance, given a set of text documents, we segment each text document to paragraphs. In the meantime, using LDA, we can extract a set of topics from the text documents. Based on our matching process, each paragraph is assigned to a topic, indicating that the paragraph best matches the topic. Finally, each topic has several best matched paragraphs. Furthermore, assuming there are a topic (e.g., Unemployment Problem) and the best matched paragraph (e.g., Up to 300 workers lost their jobs in XXX company at Seoul). In this case, we can grasp the detailed information of the social keyword such as "300 workers", "unemployment", "XXX company", and "Seoul". In addition, our system visualizes social keywords over time. Therefore, through our matching process and keyword visualization, most researchers will be able to detect social issues easily and quickly. Through this prototype system, we have detected various social issues appearing in our society and also showed effectiveness of our proposed methods according to our experimental results. Note that you can also use our proof-of-concept system in http://dslab.snu.ac.kr/demo.html.