Encoding and language detection of text document using Deep learning algorithm

Kim, Seonbeom;Bae, Junwoo;Park, Heejin;

The Journal of Korean Institute of Next Generation Computing (한국차세대컴퓨팅학회논문지)

Volume 13 Issue 5
/
Pages.124-130
/
2017
/
1975-681X(pISSN)

Korean Institute of Next Generation Computing (한국차세대컴퓨팅학회)

Encoding and language detection of text document using Deep learning algorithm

딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별

김선범 (한양대학교 컴퓨터소프트웨어학과) ;
배준우 (한양대학교 전자통신컴퓨터공학과) ;
박희진 (한양대학교 컴퓨터소프트웨어학과)

Received : 2017.10.09
Accepted : 2017.10.23
Published : 2017.10.31

⟨ Previous Next ⟩

Abstract

Character encoding is the method used to represent characters or symbols on a computer, and there are many encoding detection software tools. For the widely used encoding detection software"uchardet", the accuracy of encoding detection of unmodified normal text document is 91.39%, but the accuracy of language detection is only 32.09%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 3.55% and the accuracy of language detection is 0.06%. Therefore, in this paper, we propose encoding and language detection of text document using the deep learning algorithm called LSTM(Long Short-Term Memory). The results of LSTM are better than encoding detection software"uchardet". The accuracy of encoding detection of normal text document using the LSTM is 99.89% and the accuracy of language detection is 99.92%. Also, if a text document is encrypted by substitution, the accuracy of encoding detection is 99.26%, the accuracy of language detection is 99.77%.

문자 인코딩은 문자나 기호를 컴퓨터로 표현하기 위해 사용되는 방법이며 문자 인코딩 판별 소프트웨어들이 존재한다. 기존의 널리 쓰이는 인코딩 판별 소프트웨어인"uchardet"의 경우 변조되지 않은 일반 문서의 인코딩 판별 정확도는 91.39% 이지만 언어 판별 정확도는 32.09%에 불과하다. 또한 문서가 치환 암호에 의해 암호화 된 경우 인코딩 판별 정확도는 3.55%, 언어 판별 정확도는 0.06%로 매우 낮은 정확도를 보였다. 따라서 본 논문에서는 Deep learning 알고리즘인 LSTM(Long Short-Term Memory)을 이용한 문서의 인코딩 및 언어 판별 방법을 제안하며, 기존의 인코딩 판별 소프트웨어"uchardet"보다 뛰어난 결과를 보였다. 제안하는 방법을 이용한 일반 문서의 인코딩 판별 정확도는 99.89%이며, 언어 판별 정확도는 99.92%이다. 또한 문서가 치환 암호에 의해 암호화된 경우에는 제안하는 방법의 인코딩 판별 정확도는 99.26%이며, 언어 판별 정확도는 99.77%로 매우 뛰어나다.

The Journal of Korean Institute of Next Generation Computing (한국차세대컴퓨팅학회논문지)

Encoding and language detection of text document using Deep learning algorithm

딥러닝 알고리즘을 이용한 문서의 인코딩 및 언어 판별

Abstract

Keywords

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)