• Title/Summary/Keyword: 문서지

Search Result 2,043, Processing Time 0.025 seconds

A Skew Correction for Document Images by the Extraction of Blank Lines (공백행 추출에 의한 기울어진 문서 영상의 보정)

  • 정재영;김문현
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1998.10c
    • /
    • pp.541-543
    • /
    • 1998
  • 본 논문에서는 선형적으로 기울어진 문서 영상의 기울기를 검출하기 위한 단순하면서도 효과적인 알고리즘을 제안한다. 문서 내의 인접한 두 행 사이에는 일정한 두께의 공백 행이 존재하며, 그 공백 행의 기울기는 실제 문서의 기울어진 정도를 반영한다는 사실에 기인한다. 먼저, 간단한 모폴로지 연산을 이용하여 문자행 영역과 공백행 영역을 분리한 후, 이를 일정 간격으로 수직 샘플링하여 수직선 상에 있는 모든 공백행의 중심점(행간점)을 찾는다. 전체 영상으로부터 동일한 공백 행상에 있는 임의의 두 행간점간에 계산된 기울기들의 분포를 보면 실제 문서의 기울기에서 최대 값을 가진다. 제안한 알고리즘을 다양한 형태의 가로쓰기 문서(검출 가능한 최대 기울기 : $\pm$45$^{\circ}$)에 적용하여 0.5$^{\circ}$의 오차범위 내에서 정확한 결과를 얻을 수 있음을 보인다.

  • PDF

A Syntax-Directed XML Document Editor using Abstract Syntax Tree (추상구문트리를 이용한 구문지향 XML 문서 편집기)

  • Kim Young-Chul;You Do Kyu
    • Journal of Internet Computing and Services
    • /
    • v.6 no.2
    • /
    • pp.117-126
    • /
    • 2005
  • The current text based XML document systems are editing text and don't perform syntax check. As a result, the validity of an edited XML document can't be decided it is well-formed or valid until it is parsed. This paper describes a design and implementation of the syntax-directed editing system for XML documents. Because this is tree-based system, it is easy to extend XML document. And this system is designed to validate XML documents in real-time, It is expected that this paper contributes XML related application developments.

  • PDF

Ontology Based Semantic Search System Using Inference (온톨로지를 통한 추론형 시멘틱 검색 시스템에 관한 연구)

  • 하상범;박영택
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04b
    • /
    • pp.625-627
    • /
    • 2004
  • 시멘틱 웹의 등장으로 온톨로지를 통하여 에이전트가 이해할 수 있는 의미(semantic)를 갖는 문서를 생성하는 것이 가능해졌다. 이러한 시멘틱 웹의 영역은 비즈니스 업무 효율을 증가시키고 이를 통해 이윤을 극대화시키는 방법으로 시멘틱 검색을 통한 정보검색시스템으로 확대적용 될 수 있다. 데이터베이스를 활용하여 문서를 저장하고 데이터베이스의 질의문물 사용하거나 일반적인 키워드기반의 정보검색 기법을 사용하여 자료를 검색하는 기존의 시스템은 다양한 분야에서 많이 연구되어 왔다. 본 논문에서는 온톨로지를 기반으로 추론을 적용한 시멘틱 검색시스템에 대하여 문서검색에 초점을 맞추어 연구 결과를 제안한다. 본 논문에서 제안하는 방식은 기존의 데이터베이스 질의문으로 검색이 불가능하거나 정보관리 시스템에서 단순히 키워드 매칭으로 검색되지 않는 문서에 대해서 본 시스템이 온톨로지라 추론을 통하여 문서의 검색에 가능함을 보인다. 이러한 방식은 자연어처리 검색과 유사한 검색영역을 갖는다. 이는 문서의 검색에 있어 단순히 키워드의 유사도에 의존하지 않고 Description Logic을 바탕으로 구성된 온톨로지에 미리 정의 되어있는 의미를 바탕으로 생성된 메타데이타를 가지고 추론을 하기 때문에 가능하다 또한 기존의 정보관리 시스템에서 채용한 데이터베이스를 통한 질의응답 시스템을 적용하여 온톨로지 표현언어에 대해 질의 응답이 가능한 DQL 인터페이스와 연동을 통하여 본 시스템의 속도와 효율성을 극대화시킨다.

  • PDF

X-File Viewer on a Mobile Platform (모바일 플랫폼상의 X-File Viewer)

  • Ha, Kyeoung-Ju
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.15 no.4
    • /
    • pp.61-70
    • /
    • 2010
  • In this paper, we propose a mobile document viewer which is executable on a variety of mobile platforms. The proposed viewer is designed with file decoding engine and independent module which are adopted with os independent. The proposed viewer can be used as a basis of document editing tool by analyzing the characteristics of the document file.

Self-Supervised Document Representation Method

  • Yun, Yeoil;Kim, Namgyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.5
    • /
    • pp.187-197
    • /
    • 2020
  • Recently, various methods of text embedding using deep learning algorithms have been proposed. Especially, the way of using pre-trained language model which uses tremendous amount of text data in training is mainly applied for embedding new text data. However, traditional pre-trained language model has some limitations that it is hard to understand unique context of new text data when the text has too many tokens. In this paper, we propose self-supervised learning-based fine tuning method for pre-trained language model to infer vectors of long-text. Also, we applied our method to news articles and classified them into categories and compared classification accuracy with traditional models. As a result, it was confirmed that the vector generated by the proposed model more accurately expresses the inherent characteristics of the document than the vectors generated by the traditional models.

CNN-Based Novelty Detection with Effectively Incorporating Document-Level Information (효과적인 문서 수준의 정보를 이용한 합성곱 신경망 기반의 신규성 탐지)

  • Jo, Seongung;Oh, Heung-Seon;Im, Sanghun;Kim, Seonho
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.9 no.10
    • /
    • pp.231-238
    • /
    • 2020
  • With a large number of documents appearing on the web, document-level novelty detection has become important since it can reduce the efforts of finding novel documents by discarding documents sharing redundant information already seen. A recent work proposed a convolutional neural network (CNN)-based novelty detection model with significant performance improvements. We observed that it has a restriction of using document-level information in determining novelty but assumed that the document-level information is more important. As a solution, this paper proposed two methods of effectively incorporating document-level information using a CNN-based novelty detection model. Our methods focus on constructing a feature vector of a target document to be classified by extracting relative information between the target document and source documents given as evidence. A series of experiments showed the superiority of our methods on a standard benchmark collection, TAP-DLND 1.0.

An Efficient Preprocessing System for Searching Similar Texts among Massive Document Repository (대용량 문서 집합에서 유사 문서 탐색을 위한 효과적인 전처리 시스템의 설계)

  • Park, Sun-Young;Kim, Ji-Hun;Kim, Seon-Yeong;Kim, Hyung-Joon;Cho, Hwan-Gue
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.5
    • /
    • pp.626-630
    • /
    • 2010
  • Since the paper plagiarism has become one of important social issues, it is necessary to develop system for measuring the similarity between papers. The speed and accuracy of the system are very important features. So many researchers are studying the features. In this paper, we propose a preprocessing method using 'Global Dictionary' model to enhance performance of the system. The global dictionary includes information of all words in the document repository. The system uses the model to find similar papers with low computing time. Finally our experiment showed that a set of more than 20,000 documents could be reduced to about 50 documents drastically by our filtering techniques, which proves the excellence of our system.

Design and Development of Hybrid Documents Authoring Tool (하이브리드 문서 저작도구의 설계 및 개발)

  • Hong Kwang-Jin;Jung Kee-Chul
    • Journal of Korea Multimedia Society
    • /
    • v.9 no.4
    • /
    • pp.377-387
    • /
    • 2006
  • Digital documents takes place of paper (off-line) documents, because of the advantages of digital (on-line) documents: supply of information using dynamic contents and good to communize. However, users prefer paper documents to digital documents with the advantages of paper documents: inexpensive, handy to carry, and good to read. Therefore, for providing advantages of digital documents to users who prefer paper documents, many laboratories study about methods which augment digital documents to paper documents. In this paper, we propose the Hybrid Documents Authoring Tool (HDAT), which can insert, delete, and modify on-line information to the off-line. The proposed system is a unified authoring tool for reading and writing of on-line information. And we provide the most natural environment to users using computer vision technology without additional devices such as markers or patterns to retrieve documents. As shown by experimental results, we make sure that our proposed system has high usability and good efficiency on various environments through we measure the low-level of system requirement.

  • PDF

A Study on the efficiency of similarity and clustering measure in Historical Writing Document (역사적 기록 문서에서 효율적인 유사도 및 클러스터링 측정에 관한 연구)

  • 한광덕
    • Journal of the Korea Society of Computer and Information
    • /
    • v.7 no.4
    • /
    • pp.94-101
    • /
    • 2002
  • It expected a lot of changes in mass media and documentation expression as documents on web are getting diverse, complex and massive. An Annals of The Chosun Dynasty is a very important document used for researching historical facts and is published as CD-Rom. However. The CD-Rom was composed as content-based and using simple search method, therefore it's very difficult to make determine event-relationship between documents factors. Because of that, we studied to discover event-relationship between documents through clustering and efficient similarity method among Annals of The Chosun Dynasty. For the research method, we discovered the best similarity method for historical written documents through simulation similarity measures of Annals of The Chosun Dynasty documents. Then we did simulation-clustering documents based on similarity probability. In evaluation of the clustered documents , the results were the same as when manually figured.

  • PDF

Document Clustering based on Level-wise Stop-word Removing for an Efficient Document Searching (효율적인 문서검색을 위한 레벨별 불용어 제거에 기반한 문서 클러스터링)

  • Joo, Kil Hong;Lee, Won Suk
    • The Journal of Korean Association of Computer Education
    • /
    • v.11 no.3
    • /
    • pp.67-80
    • /
    • 2008
  • Various document categorization methods have been studied to provide a user with an effective way of browsing a large scale of documents. They do compares set of documents into groups of semantically similar documents automatically. However, the automatic categorization method suffers from low accuracy. This thesis proposes a semi-automatic document categorization method based on the domains of documents. Each documents is belongs to its initial domain. All the documents in each domain are recursively clustered in a level-wise manner, so that the category tree of the documents can be founded. To find the clusters of documents, the stop-word of each document is removed on the document frequency of a word in the domain. For each cluster, its cluster keywords are extracted based on the common keywords among the documents, and are used as the category of the domain. Recursively, each cluster is regarded as a specified domain and the same procedure is repeated until it is terminated by a user. In each level of clustering, a user can adjust any incorrectly clustered documents to improve the accuracy of the document categorization.

  • PDF