• Title/Summary/Keyword: 문서 구조 인식

Search Result 133, Processing Time 0.022 seconds

Font Classification using NMF and EMD (NMF와 EMD를 이용한 영문자 활자체 폰트분류)

  • Lee, Chang-Woo;Kang, Hyun;Jung, Kee-Chul;Kim, Hang-Joon
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04b
    • /
    • pp.688-690
    • /
    • 2004
  • 최근 전자화된 문서 영상을 효율적으로 관리하고 검색하기 위한 문서구조분석 방법과 문서의 자동 분류에 관한 많은 연구가 발표되고 있다. 본 논문에서는 NMF(non-negative matrix factorization) 알고리즘을 사용하여 폰트를 자동으로 분류하는 방법을 제안한다. 제안된 방법은 폰트의 구분 특징들이 공간적으로 국부성을 가지는 부분으로 표현될 수 있다는 가정을 바탕으로, 전체의 폰트 이미지들로부터 각 폰트들의 구분 특징인 부분을 학습하고, 학습된 부분들을 특징으로 사용하여 폰트를 분류하는 방법이다. 학습된 폰트의 특징들은 계층적 군집화 알고리즘을 이용하여 템플릿을 생성하고, 테스트 패턴을 분류하기 위하여 템플릿 패턴과의 EMD(earth mover's distance)를 사용한다. 실험결과에서 폰트 이미지들의 공간적으로 국부적인 특징들이 조사되고, 그 특징들의 폰트 식별을 위한 적절성을 보였다. 제안된 방법이 기존의 문자인식. 문서 검색 시스템들의 전처리기로 사용되면. 그 시스템들의 성능을 향상시킬 것으로 기대된다.

  • PDF

Recognition of Various Printed Hangul Images by using the Boundary Tracing Technique (경계선 기울기 방법을 이용한 다양한 인쇄체 한글의 인식)

  • Baek, Seung-Bok;Kang, Soon-Dae;Sohn, Young-Sun
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.13 no.1
    • /
    • pp.1-5
    • /
    • 2003
  • In this paper, we realized a system that converts the character images of the printed Korean alphabet (Hangul) to the editable text documents by using the black and white CCD camera, We were able to abstract the contours information of the character which is based on the structural character by using the boundary tracing technique that is strong to the noise on the character recognition. By using the contours information, we recognized the horizontal vowels and vertical vowels of the character image and classify the character into the six patterns. After that, the character is divided to the unit of the consonant and vowel. The vowels are recognized by using the maximum length projection. The separated consonants are recognized by comparing the inputted pattern with the standard pattern that has the phase information of the boundary line change. We realized a system that the recognized characters are inputted to the word editor with the editable KS Hangul completion type code.

A study on RDM algorithm for document image and application to digital signature (문서화상에 대한 RDM 합성 알고리즘 및 디지틀 서명에의 응용)

  • 박일남;이대영
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.21 no.12
    • /
    • pp.3056-3068
    • /
    • 1996
  • This papre presents the RDM algorithm for composition of bit. After this, we propose a digital signature scheme for facsimile document using RDM algorithm. We modify the even-odd feature in distance of changing pel between coding line and multiple reference line which have been scanned before, and run-length in coding line. The time to take in signature is reduced by spreading of signature. Non-repudiation in origin, the 3rd condition of digital signature is realized by proposed digital signature scheme. The transmitter embeds the signature secretly and tensfers it, and the receiver makes a check of any forgery on the signature and the document. This scheme is compatible with the ITU-T.4(G3 or G4 facsimile standard). The total amount of data transmitted and the quality of image are about the same to that of the original document, thus a third party does not notics signature embeded on the document.

  • PDF

A Study on the Integration of Information Extraction Technology for Detecting Scientific Core Entities based on Large Resources (대용량 자원 기반 과학기술 핵심개체 탐지를 위한 정보추출기술 통합에 관한 연구)

  • Choi, Yun-Soo;Cheong, Chang-Hoo;Choi, Sung-Pil;You, Beom-Jong;Kim, Jae-Hoon
    • Journal of Information Management
    • /
    • v.40 no.4
    • /
    • pp.1-22
    • /
    • 2009
  • Large-scaled information extraction plays an important role in advanced information retrieval as well as question answering and summarization. Information extraction can be defined as a process of converting unstructured documents into formalized, tabular information, which consists of named-entity recognition, terminology extraction, coreference resolution and relation extraction. Since all the elementary technologies have been studied independently so far, it is not trivial to integrate all the necessary processes of information extraction due to the diversity of their input/output formation approaches and operating environments. As a result, it is difficult to handle scientific documents to extract both named-entities and technical terms at once. In this study, we define scientific as a set of 10 types of named entities and technical terminologies in a biomedical domain. in order to automatically extract these entities from scientific documents at once, we develop a framework for scientific core entity extraction which embraces all the pivotal language processors, named-entity recognizer, co-reference resolver and terminology extractor. Each module of the integrated system has been evaluated with various corpus as well as KEEC 2009. The system will be utilized for various information service areas such as information retrieval, question-answering(Q&A), document indexing, dictionary construction, and so on.

Implementation of a Journal's Table of Contents Separation System based on Contents Analysis (내용분석을 통한 논문지의 목차분류 시스템의 구현)

  • Kwon, Young-Bin
    • The KIPS Transactions:PartB
    • /
    • v.14B no.7
    • /
    • pp.481-492
    • /
    • 2007
  • In this paper, a method for automatic indexing of contents to reduce effort for inputting paper information and constructing index is considered. Existing document analysis methods can't analyse various table of contents of journal paper formats efficiently because they have many exceptions. In this paper, various contents formats for journals, which have different features from those for general documents, are analysed and described. The principal elements that we want to represent are titles, authors, and pages for each papers. Thus, the three principal elements are modeled according to the order of their arrangement, and their features are extracted. And a table of content recognition system of journal is implemented, based on the proposed modeling method. The accuracy of exact extraction ratio of 91.5% on title, author, and page type on 660 published papers of various journals is obtained.

A Proposal On Digital Signature For FAX Document Using DM Algorithm (FAX 문서에 대한 DM 합성 알고리즘을 이용한 디지털 서명의 제안)

  • 박일남;이대영
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.7 no.2
    • /
    • pp.55-72
    • /
    • 1997
  • This paper presents a digital signature scheme for facsimile document which directly embeds a signature onto the document. We use multiple reference lines which have been scanned just before and modify each distance of changing pels both on the reference line specified by key and on the coding line with a single bit of the signature data. The time to take in signature is reduced by spreading of signature. Non-repudiation in origin, the 3rd condition of digital signature is realized by proposed digital signature scheme. The transmitter embeds the signature secretly and transfers it, and the receiver makes a check of any forgery on the signature and the document. This scheme is compatible with the ITU-T.4(CCITT G3 or G4 facsimile standards). The total amount of data transmitted and the image quality are about the same to that of the original document, and thus a third party notices that no signature is embedded on the document.

A Knowledge-based Wrapper Learning Agent for Semi-Structured Information Sources (준구조화된 정보소스에 대한 지식기반의 Wrapper 학습 에이전트)

  • Seo, Hee-Kyoung;Yang, Jae-Young;Choi, Joong-Min
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.1_2
    • /
    • pp.42-52
    • /
    • 2002
  • Information extraction(IE) is a process of recognizing and fetching particular information fragments from a document. In previous work, most IE systems generate the extraction rules called the wrappers manually, and although this manual wrapper generation may achieve more correct extraction, it reveals some problems in flexibility, extensibility, and efficiency. Some other researches that employ automatic ways of generating wrappers are also experiencing difficulties in acquiring and representing useful domain knowledge and in coping with the structural heterogeneity among different information sources, and as a result, the real-world information sources with complex document structures could not be correctly analyzed. In order to resolve these problems, this paper presents an agent-based information extraction system named XTROS that exploits the domain knowledge to learn from documents in a semi-structured information source. This system generates a wrapper for each information source automatically and performs information extraction and information integration by applying this wrapper to the corresponding source. In XTROS, both the domain knowledge and the wrapper are represented as XML-type documents. The wrapper generation algorithm first recognizes the meaning of each logical line of a sample document by using the domain knowledge, and then finds the most frequent pattern from the sequence of semantic representations of the logical lines. Eventually, the location and the structure of this pattern represented by an XML document becomes the wrapper. By testing XTROS on several real-estate information sites, we claim that it creates the correct wrappers for most Web sources and consequently facilitates effective information extraction and integration for heterogeneous and complex information sources.

A System for the Decomposition of Text Block into Words (텍스트 영역에 대한 단어 단위 분할 시스템)

  • Jeong, Chang-Boo;Kwag, Hee-Kue;Jeong, Seon-Hwa;Kim, Soo-Hyung
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2000.10a
    • /
    • pp.293-296
    • /
    • 2000
  • 본 논문에서는 주제어 인식에 기반한 문서영상의 검색 및 색인 시스템에 적용하기 위한 단어 단위 분한 시스템을 제안한다. 제안 시스템은 영상 전처리, 문서 구조 분석을 통해 추출된 텍스트 영역을 입력으로 단어 단위 분할을 수행하는데, 텍스트 영역에 대해 텍스트 라인을 분할하고 분할된 텍스트 라인을 단어 단위로 분할하는 계층적 접근 방법을 사용한다. 텍스트라인 분할은 수평 방향 투영 프로파일을 적용하여 분할 지점을 구한다. 그리고 단어 분할은 연결요소들을 추출한 후 연결요소간의 gap 정보를 구하고, gap 군집화 기법을 사용하여 단어 단위 분한 지점을 구한다. 이때 단어 단위 분할의 성능을 저하시키는 특수기호에 대해서는 휴리스틱 정보를 이용하여 검출한다. 제안 시스템의 성능 평가는 50개의 텍스트 영역에 적용하여 99.83%의 정확도를 얻을 수 있었다.

  • PDF

A Study Markup Language using Design of DOM From (DOM형식 설계를 이용한 마크업 언어연구)

  • Lee, Don-Yang;Choi, Han-Yong
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2005.11a
    • /
    • pp.341-344
    • /
    • 2005
  • DOM은 기본적으로 XML 문서를 구조적으로 표현한 것이다. 그리고 DOM은 XML문서를 노드의 트리로 인식하며, 이 노드는 동작이 가능한 오브젝트들로 구성되었다. 여기서 각 엘리먼트는 노드이며, 이 노드는 서브트리를 구성할 수 있다. 본 논문에서는 DOM 트리생성을 이용한 XML 스키마의 생성 방법 중 기본적인 사용형태인 사용자 정의 심플타입 DOM 트리 설계의 모든 노드 요소들은 IXMLDOMElement의 형식으로 엘리먼트들을 정의하여 클래스내의 단위 엘리먼트의 속성여부와 모델 내의 클래스 관계를 표현할 수 있도록 하였다. 마크업언어의 생성에서는 XML 스키마를 이용하여 세부적인 데이터타입의 선언이 가능하도록 하고 있다.

  • PDF

Design of An Interface for Explicit Free-farm Annotation Creation (명확한 free-form annotation 생성을 위한 인터페이스 설계)

  • 손원성;김재경;최윤철;임순범
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.10d
    • /
    • pp.139-141
    • /
    • 2002
  • Free-form annotation 환경에서 정확한 annotation 정보를 생성하기 위해서는 free-form 마킹의 기하 정보와 annotated part간의 관계를 분석하는 과정에서 발생하는 ambiguity를 인식 및 해결할 수 있어야 한다. 따라서 본 논문에서는 먼저 XML 기반의 annotation 환경에서 free-form 마킹과 다양한 컨텍스트 간에 발생할 수 있는 ambiguity를 분석하였으며 이를 해결하기 위한 annotation 보정 기법을 제안한다. 제안 기법은 free-form 마킹과 annotated part간의 다양한 textual 및 문서구조를 포함하는 컨텍스트를 기반으로 하며 본 연구에서 구현한 annotation 시스템을 통하여 출력 및 교환된다. 그 결과 본 연구의 제안 기법을 통하여 생성된 free-form 마킹 정보는 기존의 기법보다 사용자가 원하는 annotated part 영역을 포함할 수 있으며 따라서 다중사용자 및 서로 다른 문서환경에서도 명확한 교환 결과를 보장할 수 있다.

  • PDF