• Title/Summary/Keyword: sentence processing

Search Result 324, Processing Time 0.025 seconds

CNN Architecture Predicting Movie Rating from Audience's Reviews Written in Korean (한국어 관객 평가기반 영화 평점 예측 CNN 구조)

  • Kim, Hyungchan;Oh, Heung-Seon;Kim, Duksu
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.9 no.1
    • /
    • pp.17-24
    • /
    • 2020
  • In this paper, we present a movie rating prediction architecture based on a convolutional neural network (CNN). Our prediction architecture extends TextCNN, a popular CNN-based architecture for sentence classification, in three aspects. First, character embeddings are utilized to cover many variants of words since reviews are short and not well-written linguistically. Second, the attention mechanism (i.e., squeeze-and-excitation) is adopted to focus on important features. Third, a scoring function is proposed to convert the output of an activation function to a review score in a certain range (1-10). We evaluated our prediction architecture on a movie review dataset and achieved a low MSE (e.g., 3.3841) compared with an existing method. It showed the superiority of our movie rating prediction architecture.

Processing Korean Relative Adnominal Clauses (한국어 관계관형절의 전산처리)

  • Hong, Jung-Ha;Lee, Ki-Yong
    • Annual Conference on Human and Language Technology
    • /
    • 1999.10e
    • /
    • pp.265-271
    • /
    • 1999
  • 이 논문은 한국어 관계관형절(relative adnominal clause)의 전산처리에 적합한 통사 의미 표상 모형을 제시하고, 그 결과를 전산적 구현을 통해서 검증하는 것이 목적이다. 이를 위해 이 논문에서는 다음의 두 가지 문제를 중심으로 관계관형절의 통사 의미 표상과 전산적 구현 문제를 다룬다. 첫째, 관계관형절의 수식을 받는 머리 명사(head noun)는 관계관형절과 모문(matrix sentence)에서 각각 다른 의미역할을 하는 논항이다. 즉, 하나의 논항이 두 개의 의미역을 표상한다. 이 논문의 첫째 과제는 이러한 관계관형절 구문에서 머리 명사의 이중의미역을 표상하는 방법을 모색하는 것이다. 둘째, 관계관형절이 일항술어로 구성될 때, 서술어 단독으로 머리 명사를 수식할 수 있을 뿐만 아니라, 주격중출 구문을 관계화하여 미리 명사를 수식할 수도 있다. 그러나 모든 일항술어가 주격중출 구문을 구성할 수 있는 것은 아니기 때문에 주격중출 구문의 관계화가 가능한 경우와 그렇지 않은 경우를 구별할 필요가 있다. 이 논문의 둘째 과제는 이러한 주격중출 구문의 관계화와 그 표상의 문제를 다루는 것이다. 이 논문에서는 이러한 문제들을 단순히 기술하는 데 그치지 않고 전산 구현을 통해 문제해결을 제시한다. 이를 위해 구현 도구로 C-언어를 보강하여 개발한 문법개발 도구언어인 말라가(Malaga)를 사용하며, 분석결과를 자질구조(feature structure)로 명시하여 그 타당성을 검토한다.

  • PDF

Development of Optimum Rutin Extraction Process from Fagopyrum tataricum (쓴 메밀에서의 루틴 추출 최적 공정 개발)

  • Yoon, Seong-Jun;Cho, Nam-Ji;Na, Seog-Hwan;Kim, Young-Ho;Kim, Young-Mo
    • Journal of the East Asian Society of Dietary Life
    • /
    • v.16 no.5
    • /
    • pp.573-577
    • /
    • 2006
  • The rutin content of Fagopyrum tataricum is 100-fold higher than that of Fagopyrum esculentum. For the development of a rutin-containing beverage, a suitable method to extract rutin from buckwheat (Fagopyrum tataricum) with high rutin yield was investigated. A roasting temperature range of $310/240^{\circ}C$ (Ed-confirm that this is indeed a range; otherwise perhaps, 'Roasting temperatures ranging from 310 to $240^{\circ}C$ were considered$\ldots$') was considered to be the best as the basic color reference. Rutin content varied according to the roasting time and heating temperature; i.e., it decreased with increasing roasting time and temperature. (Ed- this sentence is unnecessarily complicated and should be simplified to 'Rutin content decreased with increasing roasting time and heating temperature.') The optimal extraction temperature and processing time were obtained as $80^{\circ}C$ and 10 minutes to maximize the rutin concentration in the extract.

  • PDF

Word Sense Classification Using Support Vector Machines (지지벡터기계를 이용한 단어 의미 분류)

  • Park, Jun Hyeok;Lee, Songwook
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.563-568
    • /
    • 2016
  • The word sense disambiguation problem is to find the correct sense of an ambiguous word having multiple senses in a dictionary in a sentence. We regard this problem as a multi-class classification problem and classify the ambiguous word by using Support Vector Machines. Context words of the ambiguous word, which are extracted from Sejong sense tagged corpus, are represented to two kinds of vector space. One vector space is composed of context words vectors having binary weights. The other vector space has vectors where the context words are mapped by word embedding model. After experiments, we acquired accuracy of 87.0% with context word vectors and 86.0% with word embedding model.

Generating Test Data for Deep Neural Network Model using Synonym Replacement (동의어 치환을 이용한 심층 신경망 모델의 테스트 데이터 생성)

  • Lee, Min-soo;Lee, Chan-gun
    • Journal of Software Engineering Society
    • /
    • v.28 no.1
    • /
    • pp.23-28
    • /
    • 2019
  • Recently, in order to effectively test deep neural network model for image processing application, researches have actively conducted to automatically generate data in corner-case that is not correctly predicted by the model. This paper proposes test data generation method that selects arbitrary words from input of system and transforms them into synonyms in order to test the bug reporter automatic assignment system based on sentence classification deep neural network model. In addition, we compare and evaluate the case of using proposed test data generation and the case of using existing difference-inducing test data generations based on various neuron coverages.

SMS Text Messages Filtering using Word Embedding and Deep Learning Techniques (워드 임베딩과 딥러닝 기법을 이용한 SMS 문자 메시지 필터링)

  • Lee, Hyun Young;Kang, Seung Shik
    • Smart Media Journal
    • /
    • v.7 no.4
    • /
    • pp.24-29
    • /
    • 2018
  • Text analysis technique for natural language processing in deep learning represents words in vector form through word embedding. In this paper, we propose a method of constructing a document vector and classifying it into spam and normal text message, using word embedding and deep learning method. Automatic spacing applied in the preprocessing process ensures that words with similar context are adjacently represented in vector space. Additionally, the intentional word formation errors with non-alphabetic or extraordinary characters are designed to avoid being blocked by spam message filter. Two embedding algorithms, CBOW and skip grams, are used to produce the sentence vector and the performance and the accuracy of deep learning based spam filter model are measured by comparing to those of SVM Light.

A review of Chinese named entity recognition

  • Cheng, Jieren;Liu, Jingxin;Xu, Xinbin;Xia, Dongwan;Liu, Le;Sheng, Victor S.
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.6
    • /
    • pp.2012-2030
    • /
    • 2021
  • Named Entity Recognition (NER) is used to identify entity nouns in the corpus such as Location, Person and Organization, etc. NER is also an important basic of research in various natural language fields. The processing of Chinese NER has some unique difficulties, for example, there is no obvious segmentation boundary between each Chinese character in a Chinese sentence. The Chinese NER task is often combined with Chinese word segmentation, and so on. In response to these problems, we summarize the recognition methods of Chinese NER. In this review, we first introduce the sequence labeling system and evaluation metrics of NER. Then, we divide Chinese NER methods into rule-based methods, statistics-based machine learning methods and deep learning-based methods. Subsequently, we analyze in detail the model framework based on deep learning and the typical Chinese NER methods. Finally, we put forward the current challenges and future research directions of Chinese NER technology.

An Efficient Machine Learning-based Text Summarization in the Malayalam Language

  • P Haroon, Rosna;Gafur M, Abdul;Nisha U, Barakkath
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.6
    • /
    • pp.1778-1799
    • /
    • 2022
  • Automatic text summarization is a procedure that packs enormous content into a more limited book that incorporates significant data. Malayalam is one of the toughest languages utilized in certain areas of India, most normally in Kerala and in Lakshadweep. Natural language processing in the Malayalam language is relatively low due to the complexity of the language as well as the scarcity of available resources. In this paper, a way is proposed to deal with the text summarization process in Malayalam documents by training a model based on the Support Vector Machine classification algorithm. Different features of the text are taken into account for training the machine so that the system can output the most important data from the input text. The classifier can classify the most important, important, average, and least significant sentences into separate classes and based on this, the machine will be able to create a summary of the input document. The user can select a compression ratio so that the system will output that much fraction of the summary. The model performance is measured by using different genres of Malayalam documents as well as documents from the same domain. The model is evaluated by considering content evaluation measures precision, recall, F score, and relative utility. Obtained precision and recall value shows that the model is trustable and found to be more relevant compared to the other summarizers.

Compressing intent classification model for multi-agent in low-resource devices (저성능 자원에서 멀티 에이전트 운영을 위한 의도 분류 모델 경량화)

  • Yoon, Yongsun;Kang, Jinbeom
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.3
    • /
    • pp.45-55
    • /
    • 2022
  • Recently, large-scale language models (LPLM) have been shown state-of-the-art performances in various tasks of natural language processing including intent classification. However, fine-tuning LPLM requires much computational cost for training and inference which is not appropriate for dialog system. In this paper, we propose compressed intent classification model for multi-agent in low-resource like CPU. Our method consists of two stages. First, we trained sentence encoder from LPLM then compressed it through knowledge distillation. Second, we trained agent-specific adapter for intent classification. The results of three intent classification datasets show that our method achieved 98% of the accuracy of LPLM with only 21% size of it.

Phrase-Chunk Level Hierarchical Attention Networks for Arabic Sentiment Analysis

  • Abdelmawgoud M. Meabed;Sherif Mahdy Abdou;Mervat Hassan Gheith
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.9
    • /
    • pp.120-128
    • /
    • 2023
  • In this work, we have presented ATSA, a hierarchical attention deep learning model for Arabic sentiment analysis. ATSA was proposed by addressing several challenges and limitations that arise when applying the classical models to perform opinion mining in Arabic. Arabic-specific challenges including the morphological complexity and language sparsity were addressed by modeling semantic composition at the Arabic morphological analysis after performing tokenization. ATSA proposed to perform phrase-chunks sentiment embedding to provide a broader set of features that cover syntactic, semantic, and sentiment information. We used phrase structure parser to generate syntactic parse trees that are used as a reference for ATSA. This allowed modeling semantic and sentiment composition following the natural order in which words and phrase-chunks are combined in a sentence. The proposed model was evaluated on three Arabic corpora that correspond to different genres (newswire, online comments, and tweets) and different writing styles (MSA and dialectal Arabic). Experiments showed that each of the proposed contributions in ATSA was able to achieve significant improvement. The combination of all contributions, which makes up for the complete ATSA model, was able to improve the classification accuracy by 3% and 2% on Tweets and Hotel reviews datasets, respectively, compared to the existing models.