• Title/Summary/Keyword: 문서 병합

Search Result 40, Processing Time 0.025 seconds

Automatic Text Categorization Using Passage-based Weight Function and Passage Type (문단 단위 가중치 함수와 문단 타입을 이용한 문서 범주화)

  • Joo, Won-Kyun;Kim, Jin-Suk;Choi, Ki-Seok
    • The KIPS Transactions:PartB
    • /
    • v.12B no.6 s.102
    • /
    • pp.703-714
    • /
    • 2005
  • Researches in text categorization have been confined to whole-document-level classification, probably due to lacks of full-text test collections. However, full-length documents availably today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of sub-topic text blocks, or passages. In order to reflect the sub-topic structure of a document, we propose a new passage-level or passage-based text categorization model, which segments a test document into several Passages, assigns categories to each passage, and merges passage categories to document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. By using four subsets of Routers text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluated the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show simple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to main topic(s), depending on their location in the test document.

Design and Implementation of XML Document Transformation System based on Structured Differences Analysis (구조적 상이성 분석에 기반한 XML 문서 변환 시스템의 설계 및 구현)

  • Jo, Jeong-Gil;Jo, Yun-Gi;Gu, Yeon-Seol
    • The KIPS Transactions:PartD
    • /
    • v.9D no.2
    • /
    • pp.297-306
    • /
    • 2002
  • This paper handles the design and implementation of the system for transforming the XML document bated on XML Schema being different in syntax but similar in logic, with using structured differences analysis. In the system, the merge data is generated from the source and destination documents by utilizing data registry and structured differences analysis, and then XML document is generated from the generated merge data. The XML document transformation system is designed that transformation process to the present application system from the different application system gains advantage in the aspect of time, cost, and reliability. The implementation environment of the system is that it is run on IBM compatible PC and it is developed using the software of visual basic 6.0 with the Platform of Windows 2000.

Global Encoding Technique for Indexing Multiple XML Documents (다중 XML 문서 인덱싱을 위한 전역 인코딩 기법)

  • Bae, Jin-Uk;Moon, Bong-Ki;Lee, Suk-Ho
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07b
    • /
    • pp.73-75
    • /
    • 2005
  • 지금까지 제안된 구조조인 알고리즘들은 하나의 XML 문서에 대해 복잡한 질의를 빠르게 처리할 수 있다는 장점이 있다. 하지만, 다중 문서를 처리할 때 각 문서에 부여된 문서식별자에 의해 문서별 질의 처리를 하기 때문에, 문서의 수가 증가한다면 질의 처리 시간도 길어진다는 문제점이 발생한다. 이 논문에서는 이 문제를 해결하기 위해 XML 문서를 XMAS 트리로 병합한 뒤 전역적으로 인코딩을 하는 기법을 제안한다. XMAS 트리는 각 문서의 구조 정보를 유지한 채 공통된 부분을 공유하는 트리이다. 이 공유에 의해서 질의 처리시에 성능 향상을 얻을 수 있다. 실험 결과, 선형 질의에 대해 수백 배, 가지모양 질의에 대해 수십 배 빠르게 질의를 처리할 수 있었다.

  • PDF

Comparison of Document Clustering Performance Using Various Dimension Reduction Methods (다양한 차원 축소 기법을 적용한 문서 군집화 성능 비교)

  • Cho, Heeryon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2018.05a
    • /
    • pp.437-438
    • /
    • 2018
  • 문서 군집화 성능을 높이기 위한 한 방법으로 차원 축소를 적용한 문서 벡터로 군집화를 실시하는 방법이 있다. 본 발표에서는 특이값 분해(SVD), 커널 주성분 분석(Kernel PCA), Doc2Vec 등의 차원 축소 기법을, K-평균 군집화(K-means clustering), 계층적 병합 군집화(hierarchical agglomerative clustering), 스펙트럼 군집화(spectral clustering)에 적용하고, 그 성능을 비교해 본다.

A Design of Item pool System using XML (XML을 이용한 문제은행 시스템 설계)

  • 강동헌;김종우
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.10b
    • /
    • pp.556-558
    • /
    • 2001
  • 본 논문에서는 XML을 이용하여 문제은행 시스템(Item pool System)을 설계하였다. XML은 단순하면서도 구조적인 내용을 표현하는데 용이하며, 확장성이 뛰어난 마크업 언어이다. 문제 은행은 문항의 형태, 내용, 난이도, 등을 포함한 군항의 특성과 관련된 정보들을 체계적으로 제시해야하므로 데이터베이스에 저장된 내용들을 체계적으로 표현할 수 있어야한다. 따라서, 본 논문은 내용과 관련된 태그를 직접 만들 수 있고, 여러 개의 XML 문서들을 하나의 큰 문서로 병합할 수 있으며, 어떠한 종류의 응용프로그램과도 통합할 수 있는 범용적 데이터베이스라고도 할 수 있는 XML의 특성을 이용하여 좀더 확장성이 좋아지게 하였다.

  • PDF

Practical Page Segmentation using Connected Components and Color Information (연결요소와 색상정보를 이용한 실제적 문서영상 분할)

  • Kim, Pyeoung-Kee
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.1
    • /
    • pp.273-285
    • /
    • 2000
  • While page segmentation is an important step in document recognition, there haven's been many researches on it. More improvement is still needed on the segmentation of document elements in complicated or color documents. In this paper, I present a new page segmentation method which can segment pages with multiple columns, dotted lines, graphics, and photographs. I extract all connected components using contour following and combine them depending on the size and positional information of them. Separate text location is done for non-text color regions to extract possible text lines. To see the performance of the proposed method, experiments are done for 180 documents. Four commercial OCR programs are also tested and the proposed method showed the best result.

  • PDF

A Study pn Development of collaborative Document Authoring system based on DOM (DOM에 기반한 공동 문서 저작 시스템 구현에 관한 연구)

  • Yu, Seong-Ju;Kim, Cha-Jong;Shin, Hyun-Sub
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.14 no.12
    • /
    • pp.2601-2608
    • /
    • 2010
  • It is difficult to merge text document and to remake use of documents on the most collaborative document authoring system using text document, and also to provide the storage place for saving and keeping documents. It has vulnerable drawbacks about the security though it provides the accessible abilities due to basing it on Web. In this paper, we design and implementation the collaborative document authoring system for XML document to improve a couple of problems on these systems. For these, we based on the DOM to manipulate the modeling object documents and utilized RMI on this system without considering socket communication when it transmits and receives Java objects. We improved the security through processes of authentication. By providing templates and editing functions such as annotation, visualization of document structures, we made easier making collaborative document authoring more than ever.

Text Detection and Binarization using Color Variance and an Improved K-means Color Clustering in Camera-captured Images (카메라 획득 영상에서의 색 분산 및 개선된 K-means 색 병합을 이용한 텍스트 영역 추출 및 이진화)

  • Song Young-Ja;Choi Yeong-Woo
    • The KIPS Transactions:PartB
    • /
    • v.13B no.3 s.106
    • /
    • pp.205-214
    • /
    • 2006
  • Texts in images have significant and detailed information about the scenes, and if we can automatically detect and recognize those texts in real-time, it can be used in various applications. In this paper, we propose a new text detection method that can find texts from the various camera-captured images and propose a text segmentation method from the detected text regions. The detection method proposes color variance as a detection feature in RGB color space, and the segmentation method suggests an improved K-means color clustering in RGB color space. We have tested the proposed methods using various kinds of document style and natural scene images captured by digital cameras and mobile-phone camera, and we also tested the method with a portion of ICDAR[1] contest images.

Feature Expansion based on LDA Word Distribution for Performance Improvement of Informal Document Classification (비격식 문서 분류 성능 개선을 위한 LDA 단어 분포 기반의 자질 확장)

  • Lee, Hokyung;Yang, Seon;Ko, Youngjoong
    • Journal of KIISE
    • /
    • v.43 no.9
    • /
    • pp.1008-1014
    • /
    • 2016
  • Data such as Twitter, Facebook, and customer reviews belong to the informal document group, whereas, newspapers that have grammar correction step belong to the formal document group. Finding consistent rules or patterns in informal documents is difficult, as compared to formal documents. Hence, there is a need for additional approaches to improve informal document analysis. In this study, we classified Twitter data, a representative informal document, into ten categories. To improve performance, we revised and expanded features based on LDA(Latent Dirichlet allocation) word distribution. Using LDA top-ranked words, the other words were separated or bundled, and the feature set was thus expanded repeatedly. Finally, we conducted document classification with the expanded features. Experimental results indicated that the proposed method improved the micro-averaged F1-score of 7.11%p, as compared to the results before the feature expansion step.

Table Structure Recognition using Borderline Heatmap Regression (딥러닝 기반의 표 경계선 히트맵 회귀를 이용한 표의 구조 인식)

  • Lee, EunJi;Park, Jaewoo;Koo, Hyung Il;Cho, Nam Ik
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • fall
    • /
    • pp.84-87
    • /
    • 2021
  • 본 논문에서는 딥러닝을 기반으로 문서영상에서 표 안의 셀 경계선을 히트맵 회귀(heatmap regression)로 추정함으로써 표의 구조를 인식하는 방법을 제안한다. 표는 기본적으로 행과 열로 이루어져 있기 때문에, 제안하는 방법에서는 먼저 1 차원 벡터 형태로 세로/가로 방향의 행/열 경계선 위치를 찾고, 이에 병합된 셀을 처리하기 위해 경계선이 그어져야 할 위치를 2 차원으로 추정한 결과를 적용하여 온전한 표의 경계선을 구한다. 이러한 구조를 통해 제안하는 방법은 표의 행과 열에 대한 정보를 효과적으로 이용함과 동시에, 복잡한 후처리 없이 병합된 셀을 처리할 수 있는 이점을 보인다. 실험은 1 차원의 행/열 경계선 위치를 반영하는 두 가지 방식에 대해 PubTabNet[11]에 대해 진행하여 결과를 보였다.

  • PDF