• 제목/요약/키워드: Hangul Document

Search Result 41, Processing Time 0.023 seconds

EXTRACTION OF CHARACTERS FROM THE QUADTREE ENCODE DOCUMENT IMAGE OF HANGUL (쿼드트리로 구성된 한글 문서 영상에서의 문자추출에 관한 연구)

  • Park, Eun-Kyoung;Cho, Dong-Sub
    • Proceedings of the KIEE Conference
    • /
    • 1991.11a
    • /
    • pp.201-204
    • /
    • 1991
  • In this paper the method of representing the document image by the quadtree data structure, and extracting each character seperately from the constructed quadtree are described. The document image is represented by a binary encoded quadtree and the segmentation is performed according to the information of each leaf node of the quadtree. Then, each character is extracted by the relation of positions of segments. This method enables to extract characters without examining every pixel in the image and the required storage of document image is decreased.

  • PDF

Study on Methods of Digitalization of Older Books Using PDF (PDF를 활용한 고문헌의 원문디지털화 방안에 대한 고찰)

  • Lee, Sang-Yong
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.34 no.1
    • /
    • pp.133-153
    • /
    • 2000
  • This article is a study on methods of digitalization for eider books using PDF (Portable Document Format) supported by Acrobat 4.0 which was introduced in April of 1999. Acrobat 3.0 has caused many problems in supporting Korean language or Hangul. However, the revised 4.0 version of this software made the conversion of Korean, Japanese and Chinese language possible due to its support by the multi-language fonts. Therefore, it Is possible to converse and to edit the text file of older books written with Hangul. The Acrobat Reader, the viewer of PDF, can be downloaded for free from its website. However, the digitalized text of older books by PDF has still some problems. But the user can retrieve the text of older books from the Internet easily.

  • PDF

Feature Selection for a Hangul Text Document Classification System (한글 텍스트 문서 분류시스템을 위한 속성선택)

  • Lee, Jae-Sik;Cho, You-Jung
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2003.05a
    • /
    • pp.435-442
    • /
    • 2003
  • 정보 추출(Information Retrieval) 시스템은 거대한 양의 정보들 가운데 필요한 정보의 적절한 탐색을 도와주기 위한 도구이다. 이는 사용자가 요구하는 정보를 보다 정확하고 보다 효과적이면서 보다 효율적으로 전달해주어야만 한다. 그러기 위해서는 문서내의 무수히 많은 속성들 가운데 해당 문서의 특성을 잘 반영하는 속성만을 선별해서 적절히 활용하는 것이 절실히 요구된다. 이에 본 연구는 기존의 한글 문서 분류시스템(CB_TFIDF)[1]의 정확도와 신속성 두 가지 측면의 성능향상에 초점을 두고 있다. 기존의 영문 텍스트 문서 분류시스템에 적용되었던 다양한 속성선택 기법들 가운데 잘 알려진 세가지 즉, Information Gain, Odds Ratio, Document Frequency Thresholding을 통해 선별적인 사례베이스를 구성한 다음에 한글 텍스트 문서 분류시스템에 적용시켜서 성능을 비교 평가한 후, 한글 문서 분류시스템에 가장 적절한 속성선택 기법과 속성 선택에 대한 가이드라인을 제시하고자 한다.

  • PDF

An Adaptive Binarization Algorithm for Degraded Document Images (저화질 문서영상들을 위한 적응적 이진화 알고리즘)

  • Ju, Jae-Hyon;Oh, Jeong-Su
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37 no.7A
    • /
    • pp.581-585
    • /
    • 2012
  • This paper proposes an adaptive binarization algorithm which is highly effective for a degraded document image including printed Hangul and Chinese characters. Because of the attribute of character composed of thin horizontal strokes and thick vertical strokes, the conventional algorithms can't easily extract horizontal strokes which have weaker components than vertical ones in the degraded document image. The proposed algorithm solves the conventional algorithm's problem by adding a vertical-directional reference adaptive binarization algorithm to an omni-directional reference one. The simulation results show the proposed algorithm extracts well characters from various degraded document images.

Design and Implementation of an Electronic Approval System for Intranet in Multi-Server Environment (멀티서버 환경에서 인트라넷용 전자결재시스템 설계 및 구현)

  • 박창서;고형화
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.36C no.11
    • /
    • pp.1-9
    • /
    • 1999
  • As our society turns into the information age from the industrial one, the ministry of information and communication has set up functional software standards for electronic approval systems Several software houses have developed such systems in the client/server environment and subsequently for the intranet. Although electronic approval systems for the intranet have the advantages of less costly implementation and ease of use, they create heavy network traffic, and have a poor document processing functionality resulting from the lack of document processor in web environments. This paper describes a system design that web browsers utilize the resources of clients by adopting the ActiveX technique in order to improve such mallets mentions above. In other words, to use the Hangul word processor as a document processor, the ActiveX control and the Hangul DDE API have been implemented in the form of the DDE server/client, which is capable of mutual communication, and the flow of electronic approve system has been controled by connecting. As a result of running the implemented system lot three months through a real company in multi-server environment, it shows the high usage of electronic approval system as the tate roaches 75%-93% for some departments.

  • PDF

Efficient Hangul Word Processor (HWP) Malware Detection Using Semi-Supervised Learning with Augmented Data Utility Valuation (효율적인 HWP 악성코드 탐지를 위한 데이터 유용성 검증 및 확보 기반 준지도학습 기법)

  • JinHyuk Son;Gihyuk Ko;Ho-Mook Cho;Young-Kuk Kim
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.34 no.1
    • /
    • pp.71-82
    • /
    • 2024
  • With the advancement of information and communication technology (ICT), the use of electronic document types such as PDF, MS Office, and HWP files has increased. Such trend has led the cyber attackers increasingly try to spread malicious documents through e-mails and messengers. To counter such attacks, AI-based methodologies have been actively employed in order to detect malicious document files. The main challenge in detecting malicious HWP(Hangul Word Processor) files is the lack of quality dataset due to its usage is limited in Korea, compared to PDF and MS-Office files that are highly being utilized worldwide. To address this limitation, data augmentation have been proposed to diversify training data by transforming existing dataset, but as the usefulness of the augmented data is not evaluated, augmented data could end up harming model's performance. In this paper, we propose an effective semi-supervised learning technique in detecting malicious HWP document files, which improves overall AI model performance via quantifying the utility of augmented data and filtering out useless training data.

Document Classification using Recurrent Neural Network with Word Sense and Contexts (단어의 의미와 문맥을 고려한 순환신경망 기반의 문서 분류)

  • Joo, Jong-Min;Kim, Nam-Hun;Yang, Hyung-Jeong;Park, Hyuck-Ro
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.7
    • /
    • pp.259-266
    • /
    • 2018
  • In this paper, we propose a method to classify a document using a Recurrent Neural Network by extracting features considering word sense and contexts. Word2vec method is adopted to include the order and meaning of the words expressing the word in the document as a vector. Doc2vec is applied for considering the context to extract the feature of the document. RNN classifier, which includes the output of the previous node as the input of the next node, is used as the document classification method. RNN classifier presents good performance for document classification because it is suitable for sequence data among neural network classifiers. We applied GRU (Gated Recurrent Unit) model which solves the vanishing gradient problem of RNN. It also reduces computation speed. We used one Hangul document set and two English document sets for the experiments and GRU based document classifier improves performance by about 3.5% compared to CNN based document classifier.

Printed Hangul Recognition with Adaptive Hierarchical Structures Depending on 6-Types (6-유형 별로 적응적 계층 구조를 갖는 인쇄 한글 인식)

  • Ham, Dae-Sung;Lee, Duk-Ryong;Choi, Kyung-Ung;Oh, Il-Seok
    • The Journal of the Korea Contents Association
    • /
    • v.10 no.1
    • /
    • pp.10-18
    • /
    • 2010
  • Due to a large number of classes in Hangul character recognition, it is usual to use the six-type preclassification stage. After the preclassification, the first consonent, vowel, and last consonent can be classified separately. Though each of three components has a few of classes, classification errors occurs often due to shape similarity such as 'ㅔ' and 'ㅖ'. So this paper proposes a hierarchical recognition method which adopts multi-stage tree structures for each of 6-types. In addition, to reduce the interference among three components, the method uses the recognition results of first consonents and vowel as features of vowel classifier. The recognition accuracy for the test set of PHD08 database was 98.96%.

An SGML Document Authoring Tool (SGML 문서 저작 도구)

  • An, Bo-Hui;Yu, Jae-U;Song, Hu-Bong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.2
    • /
    • pp.512-521
    • /
    • 1999
  • SGML, defined as the ISO 8879, is a meta-language to define a document type, used as basic format for electronic documents. Since an SGML document is composed of a document type definition and a document instance conforms to the definition, it is necessary for SGML document authoring tools to compose and validate document type and document instance. In present, formal models and procedures for SGML documents are not defined, it's not easy to construct such tools. We propose a model of SGML authoring tool consists of SGML parser, document type definition editor, SGML document editor and style editor. We also introduce and implement formal procedure for each component. For user convenience, we adopted icon based visual programming method, and solved the HANGUL problems. The SGML authoring tool is implemented I Windows NT system using java and C++ programming language.

  • PDF

Construction of Printed Hangul Character Database PHD08 (한글 문자 데이터베이스 PHD08 구축)

  • Ham, Dae-Sung;Lee, Duk-Ryong;Jung, In-Suk;Oh, Il-Seok
    • The Journal of the Korea Contents Association
    • /
    • v.8 no.11
    • /
    • pp.33-40
    • /
    • 2008
  • The application of OCR moves from traditional formatted documents to the web document and natural scene images. It is usual that the new applications use not only standard fonts of Myungjo and Godic but also various fonts. The conventional databases which have mainly been constructed with standard fonts have limitations in applying to the new applications. In this paper, we generate 243 image samples for each of 2350 Hangul character classes which differs in font size, quality, and resolution. Additionally each sample was varied according to binarization threshold and rotational transformation. Through this process 2187 samples were generated for each character class. Totally 5,139,450 samples constitutes the printed Hangul character database called the PHD08. In addition, we present the characteristics and recognition performance by an commercial OCR software.