• Title/Summary/Keyword: document analysis

Search Result 1,190, Processing Time 0.031 seconds

Document Image Layout Analysis Using Image Filters and Constrained Conditions (이미지 필터와 제한조건을 이용한 문서영상 구조분석)

  • Jang, Dae-Geun;Hwang, Chan-Sik
    • The KIPS Transactions:PartB
    • /
    • v.9B no.3
    • /
    • pp.311-318
    • /
    • 2002
  • Document image layout analysis contains the process to segment document image into detailed regions and the process to classify the segmented regions into text, picture, table or etc. In the region classification process, the size of a region, the density of black pixels, and the complexity of pixel distribution are the bases of region classification. But in case of picture, the ranges of these bases are so wide that it's difficult to decide the classification threshold between picture and others. As a result, the picture has a higher region classification error than others. In this paper, we propose document image layout analysis method which has a better performance for the picture and text region classification than that of previous methods including commercial softwares. In the picture and text region classification, median filter is used in order to reduce the influence of the size of a region, the density of black pixels, and the complexity of pixel distribution. Futhermore the classification error is corrected by the use of region expanding filter and constrained conditions.

A Study on Plagiarism Detection and Document Classification Using Association Analysis (연관분석을 이용한 효과적인 표절검사 및 문서분류에 관한 연구)

  • Hwang, Insoo
    • The Journal of Information Systems
    • /
    • v.23 no.3
    • /
    • pp.127-142
    • /
    • 2014
  • Plagiarism occurs when the content is copied without permission or citation, and the problem of plagiarism has rapidly increased because of the digital era of resources available on the World Wide Web. An important task in plagiarism detection is measuring and determining similar text portions between a given pair of documents. One of the main difficulties of this task is that not all similar text fragments are examples of plagiarism, since thematic coincidences also tend to produce portions of similar text. In order to handle this problem, this paper proposed association analysis in data mining to detect plagiarism. This method is able to detect common actions performed by plagiarists such as word deletion, insertion and transposition, allowing to obtain plausible portions of plagiarized text. Experimental results employing an unsupervised document classification strategy showed that the proposed method outperformed traditionally used approaches.

An Analysis of Transaction Data of Document Delivery Service in Science & Technology Field (과학기술분야 문헌제공서비스의 트랜잭션 데이터 분석 연구)

  • 김홍렬
    • Journal of the Korean Society for information Management
    • /
    • v.21 no.2
    • /
    • pp.169-187
    • /
    • 2004
  • The purpose of this study is to analyze the usage patterns of document delivery services of domestic users based on usage transaction data about photocopying services of KISTI-DDS that the most important document delivery organization in Korea.. For the purpose of this study, it was investigated the number of processed document, type of favorite documents, ordering coverage for photocopying, delivery methods of photocopying documents for users in DDS(document delivery service) through transaction data of DDS during the past 4 years from 2000 to 2003.

Investigation on Uncertainty in Construction Bid Documents

  • Shrestha, Rabin;Lee, JeeHee
    • International conference on construction engineering and project management
    • /
    • 2022.06a
    • /
    • pp.67-73
    • /
    • 2022
  • Construction bid documents contain various errors or discrepancies giving rise to uncertainties. The errors/discrepancies/ambiguities in the bid document, if not identified and clarified before the bid, may cause dispute and conflict between the contracting parties. Given the fact that bid document is a major resource in estimating construction costs, inaccurate information in bid document can result in over/under estimating. Thus, any questions from bidders related to the errors in the bid document should be clarified by employers before bid submission. This study aims to examine the pre-bid queries, i.e., pre-bid request for information (RFI), from state DoTs of the United States to investigate error types most frequently encountered in bid documents. For the study, around 200 pre-bids RFI were collected from state DoTs and were classified into several error types (e.g., coordination error, errors in drawings). The analysis of the data showed that errors in bill of quantities is the most frequent error in the bid documents followed by errors in drawing. The study findings addressed uncertainty types in construction bid documents that should be checked during a bid process, and, in a broader sense, it will contribute to advancing the construction management body of knowledge by clarifying and classifying bid risk factors at an early stage of construction projects.

  • PDF

DEVELOPMENT OF BEST PRACTICE GUIDELINES FOR CFD IN NUCLEAR REACTOR SAFETY

  • Mahaffy, John
    • Nuclear Engineering and Technology
    • /
    • v.42 no.4
    • /
    • pp.377-381
    • /
    • 2010
  • In 2007 the Nuclear Energy Agency's Committee on the Safety of Nuclear Installations published Best Practice Guidelines for the use of CFD in Nuclear Reactor Safety. This paper provides an overview of the document' contents and highlights a few of its recommendations. The document covers the full extent of a CFD analysis from initial problem definition and selection of an appropriate tool for the analysis, through final documentation of results. It provides advice on selection of appropriate simulation software, mesh construction, and selection of physical models. In addition it contains extensive discussion of the verification and validation process that should accompany any high-quality CFD analysis.

A Methodology for Automatic Multi-Categorization of Single-Categorized Documents (단일 카테고리 문서의 다중 카테고리 자동확장 방법론)

  • Hong, Jin-Sung;Kim, Namgyu;Lee, Sangwon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.77-92
    • /
    • 2014
  • Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we propose a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. First, we attempt to find the relationship between documents and topics by using the result of topic analysis for single-categorized documents. Second, we construct a correspondence table between topics and categories by investigating the relationship between them. Finally, we calculate the matching scores for each document to multiple categories. The results imply that a document can be classified into a certain category if and only if the matching score is higher than the predefined threshold. For example, we can classify a certain document into three categories that have larger matching scores than the predefined threshold. The main contribution of our study is that our methodology can improve the applicability of traditional multi-category classifiers by generating multi-categorized documents from single-categorized documents. Additionally, we propose a module for verifying the accuracy of the proposed methodology. For performance evaluation, we performed intensive experiments with news articles. News articles are clearly categorized based on the theme, whereas the use of vulgar language and slang is smaller than other usual text document. We collected news articles from July 2012 to June 2013. The articles exhibit large variations in terms of the number of types of categories. This is because readers have different levels of interest in each category. Additionally, the result is also attributed to the differences in the frequency of the events in each category. In order to minimize the distortion of the result from the number of articles in different categories, we extracted 3,000 articles equally from each of the eight categories. Therefore, the total number of articles used in our experiments was 24,000. The eight categories were "IT Science," "Economy," "Society," "Life and Culture," "World," "Sports," "Entertainment," and "Politics." By using the news articles that we collected, we calculated the document/category correspondence scores by utilizing topic/category and document/topics correspondence scores. The document/category correspondence score can be said to indicate the degree of correspondence of each document to a certain category. As a result, we could present two additional categories for each of the 23,089 documents. Precision, recall, and F-score were revealed to be 0.605, 0.629, and 0.617 respectively when only the top 1 predicted category was evaluated, whereas they were revealed to be 0.838, 0.290, and 0.431 when the top 1 - 3 predicted categories were considered. It was very interesting to find a large variation between the scores of the eight categories on precision, recall, and F-score.

Block Classification of Document Images by Block Attributes and Texture Features (블록의 속성과 질감특징을 이용한 문서영상의 블록분류)

  • Jang, Young-Nae;Kim, Joong-Soo;Lee, Cheol-Hee
    • Journal of Korea Multimedia Society
    • /
    • v.10 no.7
    • /
    • pp.856-868
    • /
    • 2007
  • We propose an effective method for block classification in a document image. The gray level document image is converted to the binary image for a block segmentation. This binary image would be smoothed to find the locations and sizes of each block. And especially during this smoothing, the inner block heights of each block are obtained. The gray level image is divided to several blocks by these location informations. The SGLDM(spatial gray level dependence matrices) are made using the each gray-level document block and the seven second-order statistical texture features are extracted from the (0,1) direction's SGLDM which include the document attributes. Document image blocks are classified to two groups, text and non-text group, by the inner block height of the block at the nearest neighbor rule. The seven texture features(that were extracted from the SGLDM) are used for the five detail categories of small font, large font, table, graphic and photo blocks. These document blocks are available not only for structure analysis of document recognition but also the various applied area.

  • PDF

Feature Expansion based on LDA Word Distribution for Performance Improvement of Informal Document Classification (비격식 문서 분류 성능 개선을 위한 LDA 단어 분포 기반의 자질 확장)

  • Lee, Hokyung;Yang, Seon;Ko, Youngjoong
    • Journal of KIISE
    • /
    • v.43 no.9
    • /
    • pp.1008-1014
    • /
    • 2016
  • Data such as Twitter, Facebook, and customer reviews belong to the informal document group, whereas, newspapers that have grammar correction step belong to the formal document group. Finding consistent rules or patterns in informal documents is difficult, as compared to formal documents. Hence, there is a need for additional approaches to improve informal document analysis. In this study, we classified Twitter data, a representative informal document, into ten categories. To improve performance, we revised and expanded features based on LDA(Latent Dirichlet allocation) word distribution. Using LDA top-ranked words, the other words were separated or bundled, and the feature set was thus expanded repeatedly. Finally, we conducted document classification with the expanded features. Experimental results indicated that the proposed method improved the micro-averaged F1-score of 7.11%p, as compared to the results before the feature expansion step.

A Study on Library Exemption of Document Delivery Service by Interlibrary Loan (상호대차에 의한 원문복사서비스의 도서관 면책에 관한 연구)

  • Hong, Jae-Hyun
    • Journal of the Korean Society for information Management
    • /
    • v.22 no.1 s.55
    • /
    • pp.21-45
    • /
    • 2005
  • The document delivery service is an advanced service to satisfy information need of end-user by the cooperative utilization of information. At present there are various interpretation on library exemption of the document delivery service by interlibrary loan. The purpose of the study is to suggest a legal plan in order to solve the problems of library exemption dealing with the document delivery service by Fax and Ariel system. The study examined the international trends on library exemption related to the document delivery service in foreign copyright laws, analyzed the library exemption related to the document delivery service in the present Korean copyright law, and pointed out the problems of the present library exemption. As the result of this analysis, the study suggested a revised plan of the present Korean copyright law in order to provide regulation of library exemption dealing with the document delivery service by interlibrary loan. Thus the revised plan can be used as a basic data for providing the regulation for the future of library exemption in Korea.

Document Analysis based Main Requisite Extraction System (문서 분석 기반 주요 요소 추출 시스템)

  • Lee, Jongwon;Yeo, Ilyeon;Jung, Hoekyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.4
    • /
    • pp.401-406
    • /
    • 2019
  • In this paper, we propose a system for analyzing documents in XML format and in reports. The system extracts the paper or reports of keywords, shows them to the user, and then extracts the paragraphs containing the keywords by inputting the keywords that the user wants to search within the document. The system checks the frequency of keywords entered by the user, calculates weights, and removes paragraphs containing only keywords with the lowest weight. Also, we divide the refined paragraphs into 10 regions, calculate the importance of the paragraphs per region, compare the importance of each region, and inform the user of the main region having the highest importance. With these features, the proposed system can provide the main paragraphs with higher compression ratio than analyzing the papers or reports using the existing document analysis system. This will reduce the time required to understand the document.