• Title/Summary/Keyword: text vector

Search Result 284, Processing Time 0.024 seconds

Expansion of Word Representation for Named Entity Recognition Based on Bidirectional LSTM CRFs (Bidirectional LSTM CRF 기반의 개체명 인식을 위한 단어 표상의 확장)

  • Yu, Hongyeon;Ko, Youngjoong
    • Journal of KIISE
    • /
    • v.44 no.3
    • /
    • pp.306-313
    • /
    • 2017
  • Named entity recognition (NER) seeks to locate and classify named entities in text into pre-defined categories such as names of persons, organizations, locations, expressions of times, etc. Recently, many state-of-the-art NER systems have been implemented with bidirectional LSTM CRFs. Deep learning models based on long short-term memory (LSTM) generally depend on word representations as input. In this paper, we propose an approach to expand word representation by using pre-trained word embedding, part of speech (POS) tag embedding, syllable embedding and named entity dictionary feature vectors. Our experiments show that the proposed approach creates useful word representations as an input of bidirectional LSTM CRFs. Our final presentation shows its efficacy to be 8.05%p higher than baseline NERs with only the pre-trained word embedding vector.

Web Attack Classification Model Based on Payload Embedding Pre-Training (페이로드 임베딩 사전학습 기반의 웹 공격 분류 모델)

  • Kim, Yeonsu;Ko, Younghun;Euom, Ieckchae;Kim, Kyungbaek
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.30 no.4
    • /
    • pp.669-677
    • /
    • 2020
  • As the number of Internet users exploded, attacks on the web increased. In addition, the attack patterns have been diversified to bypass existing defense techniques. Traditional web firewalls are difficult to detect attacks of unknown patterns.Therefore, the method of detecting abnormal behavior by artificial intelligence has been studied as an alternative. Specifically, attempts have been made to apply natural language processing techniques because the type of script or query being exploited consists of text. However, because there are many unknown words in scripts and queries, natural language processing requires a different approach. In this paper, we propose a new classification model which uses byte pair encoding (BPE) technology to learn the embedding vector, that is often used for web attack payloads, and uses an attention mechanism-based Bi-GRU neural network to extract a set of tokens that learn their order and importance. For major web attacks such as SQL injection, cross-site scripting, and command injection attacks, the accuracy of the proposed classification method is about 0.9990 and its accuracy outperforms the model suggested in the previous study.

The Implementation of the Digital watermarking for 3D Polygonal Model (3차원 형상 모델의 디지털 워터마킹 구현)

  • Kim, Sun-Hyung;Lee, Sun-Heum;Kim, Gee-Seog;Ahn, Deog-Sang
    • The KIPS Transactions:PartD
    • /
    • v.9D no.5
    • /
    • pp.925-930
    • /
    • 2002
  • This paper discusses techniques for embedding data into 3D polygonal models of geometry. Much researches of Watermarking had been gone as element technology of DRM (digital rights management). But, few research had gone to 3D polygonal model. Most research is limited at text document, 2D image, animation, music etc. RP system is suitable a few production in various goods species, and it is used much in industry to possible reason that produce prototype and find error or incongruent factor at early stage on design in product development childhood. This paper is research about method that insert watermark in STL ( stereolithography) file that have 3D shape model. Proposed algorithm inserts watermark in normal vector region and facet's interior region of 3D shape data. For this reason, 3D shape does not produce some flexure and fulfill invisibility of watermark. Experiment results that insert and extract watermark in normal netter region and facet's Interior region of 3D shape data by proposed algorithm do not influence entirely in 3D shape and show that insertion and extraction of watermark are possible.

Optical Character Recognition for Hindi Language Using a Neural-network Approach

  • Yadav, Divakar;Sanchez-Cuadrado, Sonia;Morato, Jorge
    • Journal of Information Processing Systems
    • /
    • v.9 no.1
    • /
    • pp.117-140
    • /
    • 2013
  • Hindi is the most widely spoken language in India, with more than 300 million speakers. As there is no separation between the characters of texts written in Hindi as there is in English, the Optical Character Recognition (OCR) systems developed for the Hindi language carry a very poor recognition rate. In this paper we propose an OCR for printed Hindi text in Devanagari script, using Artificial Neural Network (ANN), which improves its efficiency. One of the major reasons for the poor recognition rate is error in character segmentation. The presence of touching characters in the scanned documents further complicates the segmentation process, creating a major problem when designing an effective character segmentation technique. Preprocessing, character segmentation, feature extraction, and finally, classification and recognition are the major steps which are followed by a general OCR. The preprocessing tasks considered in the paper are conversion of gray scaled images to binary images, image rectification, and segmentation of the document's textual contents into paragraphs, lines, words, and then at the level of basic symbols. The basic symbols, obtained as the fundamental unit from the segmentation process, are recognized by the neural classifier. In this work, three feature extraction techniques-: histogram of projection based on mean distance, histogram of projection based on pixel value, and vertical zero crossing, have been used to improve the rate of recognition. These feature extraction techniques are powerful enough to extract features of even distorted characters/symbols. For development of the neural classifier, a back-propagation neural network with two hidden layers is used. The classifier is trained and tested for printed Hindi texts. A performance of approximately 90% correct recognition rate is achieved.

A Fast stream cipher Canon (고속 스트림 암호 Canon)

  • Kim, Gil-Ho
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.7
    • /
    • pp.71-79
    • /
    • 2012
  • Propose stream cipher Canon that need in Wireless sensor network construction that can secure confidentiality and integrity. Create Canon 128 bits streams key by 128 bits secret key and 128 bits IV, and makes 128 bits cipher text through whitening processing with produced streams key and 128 bits plaintext together. Canon for easy hardware implementation and software running fast algorithm consists only of simple logic operations. In particular, because it does not use S-boxes for non-linear operations, hardware implementation is very easy. Proposed stream cipher Canon shows fast speed test results performed better than AES, Salsa20, and gate number is small than Trivium. Canon purpose of the physical environment is very limited applications, mobile phones, wireless Internet environment, DRM (Digital Right Management), wireless sensor networks, RFID, and use software and hardware implementation easy 128 bits stream ciphers.

An Analytical Study on Automatic Classification of Domestic Journal articles Based on Machine Learning (기계학습에 기초한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.37-62
    • /
    • 2018
  • This study examined the factors affecting the performance of automatic classification based on machine learning for domestic journal articles in the field of LIS. In particular, In view of the classification performance that assigning automatically the class labels to the articles in "Journal of the Korean Society for Information Management", I investigated the characteristics of the key factors(weighting schemes, training set size, classification algorithms, label assigning methods) through the diversified experiments. Consequently, It is effective to apply each element appropriately according to the classification environment and the characteristics of the document set, and a fairly good performance can be obtained by using a simpler model. In addition, the classification of domestic journals can be considered as a multi-label classification that assigns more than one category to a specific article. Therefore, I proposed an optimal classification model using simple and fast classification algorithm and small learning set considering this environment.

A Novel Sub-image Retrieval Approach using Dot-Matrix (점 행렬을 이용한 새로운 부분 영상 검색 기법)

  • Kim, Jun-Ho;Kang, Kyoung-Min;Lee, Do-Hoon
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.13 no.3
    • /
    • pp.1330-1336
    • /
    • 2012
  • The Image retrieval has been study different approaches which are text-based, contents-based, area-based method and sub-image finding. The sub-image retrieval is to find a query image in the target one. In this paper, we propose a novel sub-image retrieval algorithm by Dot-Matrix method to be used in the bioinformatics. Dot-Matrix is a method to evaluate similarity between two sequences and we redefine the problem for retrieval of sub-image to the finding similarity of two images. For the approach, the 2 dimensional array of image converts a the vector which has gray-scale value. The 2 converted images align by dot-matrix and the result shows candidate sub-images. We used 10 images as target and 5 queries: duplicated, small scaled, and large scaled images included x-axes and y-axes scaled one for experiment.

Hybrid Word-Character Neural Network Model for the Improvement of Document Classification (문서 분류의 개선을 위한 단어-문자 혼합 신경망 모델)

  • Hong, Daeyoung;Shim, Kyuseok
    • Journal of KIISE
    • /
    • v.44 no.12
    • /
    • pp.1290-1295
    • /
    • 2017
  • Document classification, a task of classifying the category of each document based on text, is one of the fundamental areas for natural language processing. Document classification may be used in various fields such as topic classification and sentiment classification. Neural network models for document classification can be divided into two categories: word-level models and character-level models that treat words and characters as basic units respectively. In this study, we propose a neural network model that combines character-level and word-level models to improve performance of document classification. The proposed model extracts the feature vector of each word by combining information obtained from a word embedding matrix and information encoded by a character-level neural network. Based on feature vectors of words, the model classifies documents with a hierarchical structure wherein recurrent neural networks with attention mechanisms are used for both the word and the sentence levels. Experiments on real life datasets demonstrate effectiveness of our proposed model.

Research on Designing Korean Emotional Dictionary using Intelligent Natural Language Crawling System in SNS (SNS대상의 지능형 자연어 수집, 처리 시스템 구현을 통한 한국형 감성사전 구축에 관한 연구)

  • Lee, Jong-Hwa
    • The Journal of Information Systems
    • /
    • v.29 no.3
    • /
    • pp.237-251
    • /
    • 2020
  • Purpose The research was studied the hierarchical Hangul emotion index by organizing all the emotions which SNS users are thinking. As a preliminary study by the researcher, the English-based Plutchick (1980)'s emotional standard was reinterpreted in Korean, and a hashtag with implicit meaning on SNS was studied. To build a multidimensional emotion dictionary and classify three-dimensional emotions, an emotion seed was selected for the composition of seven emotion sets, and an emotion word dictionary was constructed by collecting SNS hashtags derived from each emotion seed. We also want to explore the priority of each Hangul emotion index. Design/methodology/approach In the process of transforming the matrix through the vector process of words constituting the sentence, weights were extracted using TF-IDF (Term Frequency Inverse Document Frequency), and the dimension reduction technique of the matrix in the emotion set was NMF (Nonnegative Matrix Factorization) algorithm. The emotional dimension was solved by using the characteristic value of the emotional word. The cosine distance algorithm was used to measure the distance between vectors by measuring the similarity of emotion words in the emotion set. Findings Customer needs analysis is a force to read changes in emotions, and Korean emotion word research is the customer's needs. In addition, the ranking of the emotion words within the emotion set will be a special criterion for reading the depth of the emotion. The sentiment index study of this research believes that by providing companies with effective information for emotional marketing, new business opportunities will be expanded and valued. In addition, if the emotion dictionary is eventually connected to the emotional DNA of the product, it will be possible to define the "emotional DNA", which is a set of emotions that the product should have.

Implementation of Image Compression and Searching System using Wavelet Transform (Wavelet 변환을 이용한 영상압축 및 검색 시스템의 구현)

  • Yoon, Jung-Mo;Kim, Sang-Yeon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.38 no.4
    • /
    • pp.50-58
    • /
    • 2001
  • The image information, used most frequently in multimedia, is visual and spatial information. It has several characters including the diversity of storage and output methods, large capacity, spatial relationship expression, and irregularity. Therefore, the various researches for methods of storing efficiently, managing, searching such image data are going on. And recently, it has arisen the movement of international standardization, MPEG-7 for searching contents base in multimedia environment. Especially, the research for implementation of more effective image database searching system important subject, because the practical image search system which can storage a lot of image information as database and query, search them has not generalized. Now the image search system based on text has researched to high degree, but it has many shortages so that nowadays the researches for searching system based on contents are going on. This research has used the wavelet conversion largely using in image processing instead of DCT method largely using in existent system, and so it had met similar and precise results than prior methods by image compression and extraction of specific vector.

  • PDF