• Title/Summary/Keyword: character encoding

Search Result 22, Processing Time 0.024 seconds

A Method for Automatic Detection of Character Encoding of Multi Language Document File (다중 언어로 작성된 문서 파일에 적용된 문자 인코딩 자동 인식 기법)

  • Seo, Min Ji;Kim, Myung Ho
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.4
    • /
    • pp.170-177
    • /
    • 2016
  • Character encoding is a method for changing a document to a binary document file using the code table for storage in a computer. When people decode a binary document file in a computer to be read, they must know the code table applied to the file at the encoding stage in order to get the original document. Identifying the code table used for encoding the file is thus an essential part of decoding. In this paper, we propose a method for detecting the character code of the given binary document file automatically. The method uses many techniques to increase the detection rate, such as a character code range detection, escape character detection, character code characteristic detection, and commonly used word detection. The commonly used word detection method uses multiple word database, which means this method can achieve a much higher detection rate for multi-language files as compared with other methods. If the proportion of language is 20% less than in the document, the conventional method has about 50% encoding recognition. In the case of the proposed method, regardless of the proportion of language, there is up to 96% encoding recognition.

A Character Shape Encoding Method to Input Chinese Characters in Old Documents (고문헌 벽자(僻字) 입력을 위한 한자 자형 부호화 방법)

  • Kim, Kiwang
    • Journal of Korean Medical classics
    • /
    • v.32 no.1
    • /
    • pp.105-116
    • /
    • 2019
  • Objectives : There are many secluded Chinese characters - so called Byeokja (僻字) in ancient classic literature, and Chinese characters that are not registered in Unicode and Variant characters (heterogeneous characters) that cannot be found in the current font sets often appear. In order to register all possible Chinese characters including such characters as units of information exchange, this study attempts to propose a method to encode the morphological information of Chinese characters according to certain rules. Methods : This study suggests the methods to encode the connection between the nodules constituting the Chinese character and the coordinates of the nodules. In addition to that, rules for expressing information about curves, expressions of aspect ratios of characters, rules for minimizing coordinate lines, and rules for expressing aggregation status of character components are added. Results : Through the proposed method, it is possible to generate codes of a certain length by extracting only information expressing the morphological configuration of characters. Conclusions : The method of character encoding proposed in this study can be used to distinguish variant characters with small variations in Byeokja, new Chinese characters and character strokes and to store and search them.

Noise additived image encoding By EZW algorithm (EZW를 이용한 잡음 영상의 부호화)

  • 김형준;김재필;김향진;김영애;임재윤
    • Proceedings of the IEEK Conference
    • /
    • 2000.06d
    • /
    • pp.27-30
    • /
    • 2000
  • In this paper, we propose new method for denoising in processing the image compression. Usually, to compress the noise image, we must have the denoising step before encoding. But this method has a embedded character, so need not an additional noise eliminator. In SAQ step, an embedded signal is quantized more detail and the other side is suppressed. Comparing with the conventional method, we can get the enhanced image quality.

  • PDF

Encoding and language detection of text document using Deep learning algorithm (딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별)

  • Kim, Seonbeom;Bae, Junwoo;Park, Heejin
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.13 no.5
    • /
    • pp.124-130
    • /
    • 2017
  • Character encoding is the method used to represent characters or symbols on a computer, and there are many encoding detection software tools. For the widely used encoding detection software"uchardet", the accuracy of encoding detection of unmodified normal text document is 91.39%, but the accuracy of language detection is only 32.09%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 3.55% and the accuracy of language detection is 0.06%. Therefore, in this paper, we propose encoding and language detection of text document using the deep learning algorithm called LSTM(Long Short-Term Memory). The results of LSTM are better than encoding detection software"uchardet". The accuracy of encoding detection of normal text document using the LSTM is 99.89% and the accuracy of language detection is 99.92%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 99.26%, the accuracy of language detection is 99.77%.

A Study on the Hangul Character Code System for KS X 1001 Information Interchange considering AMI/HDB-3 Line Encoding and HDLC Flag (AMI/HDB-3 회선부호화 및 HDLC FLAG를 고려한 KS X 1001 정보교환용 한글낱자 부호체계 개선연구)

  • Woo, Je-Teak;Hong, Wan-Pyo
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.10 no.1
    • /
    • pp.65-72
    • /
    • 2015
  • AMI / HDB-3 method used a scrambling technique is used primarily for long distance data transmission line encoding. In this paper, information communication code standard (KS X 1001; 2014 confirmation), as defined in Hangul Character Code HDLC Flag bit or character stuffing at the data link layer and physical layer with respect to the code set for Hangul AMI / HDB-3 the code set for the new system to increase the data transmission efficiency Hangul consonant and vowel tables presented in terms of scrambling. The result of the existing system and the code set ($4{\times}4$) bit source coding rules for comparing the frequency of use Hangul consonant and vowel tables and statistics showed that about 22.01% of the data processing efficiency is improved.

Feature based Text Watermarking in Digital Binary Image (이진 문서 영상에서의 특징 기반 텍스트 워터마킹)

  • 공영민;추현곤;최종욱;김희율
    • Proceedings of the IEEK Conference
    • /
    • 2002.06d
    • /
    • pp.359-362
    • /
    • 2002
  • In this paper, we propose a new feature-based text watermarking for the binary text image. The structure of specific characters from preprocessed text image are modified to embed watermark. Watermark message are embedded and detected by the following method; Hole line disconnect using the connectivity of the character containing a hole, Center line shift using the hole area and Differential encoding using difference of flippable score points. Experimental results show that the proposed method is robust to rotation and scaling distortion.

  • PDF

Character-Level Neural Machine Translation (문자 단위의 Neural Machine Translation)

  • Lee, Changki;Kim, Junseok;Lee, Hyoung-Gyu;Lee, Jaesong
    • Annual Conference on Human and Language Technology
    • /
    • 2015.10a
    • /
    • pp.115-118
    • /
    • 2015
  • Neural Machine Translation (NMT) 모델은 단일 신경망 구조만을 사용하는 End-to-end 방식의 기계번역 모델로, 기존의 Statistical Machine Translation (SMT) 모델에 비해서 높은 성능을 보이고, Feature Engineering이 필요 없으며, 번역 모델 및 언어 모델의 역할을 단일 신경망에서 수행하여 디코더의 구조가 간단하다는 장점이 있다. 그러나 NMT 모델은 출력 언어 사전(Target Vocabulary)의 크기에 비례해서 학습 및 디코딩의 속도가 느려지기 때문에 출력 언어 사전의 크기에 제한을 갖는다는 단점이 있다. 본 논문에서는 NMT 모델의 출력 언어 사전의 크기 제한 문제를 해결하기 위해서, 입력 언어는 단어 단위로 읽고(Encoding) 출력 언어를 문자(Character) 단위로 생성(Decoding)하는 방법을 제안한다. 출력 언어를 문자 단위로 생성하게 되면 NMT 모델의 출력 언어 사전에 모든 문자를 포함할 수 있게 되어 출력 언어의 Out-of-vocabulary(OOV) 문제가 사라지고 출력 언어의 사전 크기가 줄어들어 학습 및 디코딩 속도가 빨라지게 된다. 실험 결과, 본 논문에서 제안한 방법이 영어-일본어 및 한국어-일본어 기계번역에서 기존의 단어 단위의 NMT 모델보다 우수한 성능을 보였다.

  • PDF

Encoding Dictionary Feature for Deep Learning-based Named Entity Recognition

  • Ronran, Chirawan;Unankard, Sayan;Lee, Seungwoo
    • International Journal of Contents
    • /
    • v.17 no.4
    • /
    • pp.1-15
    • /
    • 2021
  • Named entity recognition (NER) is a crucial task for NLP, which aims to extract information from texts. To build NER systems, deep learning (DL) models are learned with dictionary features by mapping each word in the dataset to dictionary features and generating a unique index. However, this technique might generate noisy labels, which pose significant challenges for the NER task. In this paper, we proposed DL-dictionary features, and evaluated them on two datasets, including the OntoNotes 5.0 dataset and our new infectious disease outbreak dataset named GFID. We used (1) a Bidirectional Long Short-Term Memory (BiLSTM) character and (2) pre-trained embedding to concatenate with (3) our proposed features, named the Convolutional Neural Network (CNN), BiLSTM, and self-attention dictionaries, respectively. The combined features (1-3) were fed through BiLSTM - Conditional Random Field (CRF) to predict named entity classes as outputs. We compared these outputs with other predictions of the BiLSTM character, pre-trained embedding, and dictionary features from previous research, which used the exact matching and partial matching dictionary technique. The findings showed that the model employing our dictionary features outperformed other models that used existing dictionary features. We also computed the F1 score with the GFID dataset to apply this technique to extract medical or healthcare information.

A Chromosome Encoding Method in A Genetic Algorithm for Path Finding in Game Map (게임 맵에서 길 찾기 해법을 위한 유전 알고리즘의 염색체 인코딩 방법)

  • Kang, Myung-Ju
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2009.01a
    • /
    • pp.189-192
    • /
    • 2009
  • 게임에서 주인공 캐릭터나 MPC(Non Player Character)가 목적지까지의 경로를 찾는 것은 매우 중요하다. 또한 캐릭터가 이동 중 다양한 오브젝트와 벽을 만나면 이를 회피해야 하며 최단 경로로 이동해야 한다. 본 논문에서는 게임 맵에서 캐릭터의 길 찾기 방법으로 유전 알고리즘을 이용하는 방법을 제안하였다. 특히, 유전 알고리즘의 구성요소 중해 집합을 구성하는 염색체 인코딩 방법을 제안하였다. 본 논문에서 제안한 염색체의 인코딩은 캐릭터의 이동 방향을 비트 스트링으로 표현하였다. 캐릭터가 현재 위치에서 이동할 수 있는 방향은 8 방향이다. 따라서 하나의 방향을 표현하기 위해서는 3비트의 이진스트링으로 나타낼 수 있다. 하나의 해를 나타내는 염색체는 3비트의 이진 스트링을 맵을 나타내는 그래프의 노드 수만큼 할당하여 구성할 수 있다.

  • PDF

Guided Sequence Generation using Trie-based Dictionary for ASR Error Correction (음성 인식 오류 수정을 위한 Trie 기반 사전을 이용한 Guided Sequence Generation)

  • Choi, Junhwi;Ryu, Seonghan;Yu, Hwanjo;Lee, Gary Geunbae
    • 한국어정보학회:학술대회논문집
    • /
    • 2016.10a
    • /
    • pp.211-216
    • /
    • 2016
  • 현재 나오는 많은 음성 인식기가 대체로 높은 정확도를 가지고 있더라도, 음성 인식 오류는 여전히 빈번하게 발생한다. 음성 인식 오류는 관련 어플리케이션에 있어 많은 오동작의 원인이 되므로, 음성 인식 오류는 고쳐져야 한다. 본 논문에서는 Trie 기반 사전을 이용한 Guided Sequence Generation을 제안한다. 제안하는 모델은 목표 단어와 그 단어의 문맥을 Encoding하고, 그로부터 단어를 Character 단위로 Decoding하며 단어를 Generation한다. 올바른 단어를 생성하기 위하여, Generation 시에 Trie 기반 사전을 통해 유도한다. 실험을 위해 모델은 영어 TV 가이드 도메인의 말뭉치의 음성 인식 오류를 단순히 Simulation하여 만들어진 말뭉치로부터 훈련되고, 같은 도메인의 음성 인식 문장과 결과로 이루어진 병렬 말뭉치에서 성능을 평가하였다. Guided Generation은 Unguided Generation에 비해 14.9% 정도의 오류를 줄였다.

  • PDF