• Title/Summary/Keyword: Training Document

Search Result 173, Processing Time 0.029 seconds

Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category

  • Park, So-Young;Chang, Juno;Kihl, Taesuk
    • Journal of information and communication convergence engineering
    • /
    • v.11 no.4
    • /
    • pp.268-273
    • /
    • 2013
  • In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.

Document Management for Jordan Research and Training Reactor Project by ANSIM (원자력 통합안전경영시스템을 이용한 요르단연구로사업의 문서관리)

  • Park, Kook-Nam;Choi, Min-Ho;Kwon, Yongse
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.39 no.2
    • /
    • pp.113-118
    • /
    • 2016
  • Project management is a tool for smooth operation during a full cycle from the design to normal operation including the schedule, document, and budget management, and document management is an important work for big projects such as the JRTR (Jordan Research and Training Reactor). To manage the various large documents for a research reactor, a project management system was resolved, a project procedure manual was prepared, and a document control system was established. The ANSIM (Advanced Nuclear Safety Information Management) system consists of a document management folder, document container folder, project management folder, organization management folder, and EPC (Engineering, Procurement and Construction) document folder. First, the system composition is a computerized version of the Inter-office Correspondence (IOC), the Document Distribution for Agreement (DDA), Design Documents, and Project Manager Memorandum (PM Memo) works prepared for the research reactor design. Second, it reviews, distributes, and approves design documents in the system and approves those documents to register and supply them to the research reactor user. Third, it integrates the information of the document system-using organization and its members, as well as users' rights regarding the ANSIM document system. Throughout these functions, the ANSIM system has been contributing to the vitalization of united research. Not only did the ANSIM system realize a design document input, data load, and search system and manage KAERI's long-period experience and knowledge information properties using a management strategy, but in doing so, it also contributed to research activation and will actively help in the construction of other nuclear facilities and exports abroad.

The Study of Integrated Document Training Materials Related to NCS Communication Ability for Petty Officer Majors (NCS 의사소통능력과 연계된 부사관과의 자료통합적 문서 교육 연구)

  • Yu, Yong-tae
    • Convergence Security Journal
    • /
    • v.19 no.2
    • /
    • pp.137-146
    • /
    • 2019
  • This study seeks into an education goal and an achievement level based on investigating relationships between NCS communication abilities and communication educations for petty officer major students. Also, the study looks deep into approriate Integrated document training materials. A goal of the petty officer's communication education, which is supposed to achieve more than the average standard is improving abilities to understand documents and create documents related to the real petty officer's life. The goal of this communication study is designed with considering the petty officers' ability factors and the detailed weekly achievement goals based on characteristics of petty officers. the proper way to reach the goal of the Integrated document training materials is constructed as three step process; Presenting subject - group activity - handing in final activity report. Also, the education is designed to write evaluation forms continuously for students to keep eyes on their achievement levels. As the importance of NCS is emphasized these days, the Integrated document training materials present the ways how this education is needed to go on, and this shows ways to improve students' document writing abilities. For the last, the study mentions a proposal for further tasks on this field.

Document Image Binarization by GAN with Unpaired Data Training

  • Dang, Quang-Vinh;Lee, Guee-Sang
    • International Journal of Contents
    • /
    • v.16 no.2
    • /
    • pp.8-18
    • /
    • 2020
  • Data is critical in deep learning but the scarcity of data often occurs in research, especially in the preparation of the paired training data. In this paper, document image binarization with unpaired data is studied by introducing adversarial learning, excluding the need for supervised or labeled datasets. However, the simple extension of the previous unpaired training to binarization inevitably leads to poor performance compared to paired data training. Thus, a new deep learning approach is proposed by introducing a multi-diversity of higher quality generated images. In this paper, a two-stage model is proposed that comprises the generative adversarial network (GAN) followed by the U-net network. In the first stage, the GAN uses the unpaired image data to create paired image data. With the second stage, the generated paired image data are passed through the U-net network for binarization. Thus, the trained U-net becomes the binarization model during the testing. The proposed model has been evaluated over the publicly available DIBCO dataset and it outperforms other techniques on unpaired training data. The paper shows the potential of using unpaired data for binarization, for the first time in the literature, which can be further improved to replace paired data training for binarization in the future.

Establishment of Document Control System for the Jordan Research and Training Reactor Project (요르단연구로건설사업 문서관리시스템 구축)

  • Park, Kook-Nam;Ko, Young-Cheol;Wu, Sang-Ik;Oh, Soo-Youl;Lee, Doo-Jeong
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.34 no.4
    • /
    • pp.49-56
    • /
    • 2011
  • The Project of Jordan Research and Training Reactor (JRTR) officially launched in Aug. 2010. JRTR is the first made-in-Korea nuclear system to be built abroad by year 2015, and Korea Atomic Energy Research Institute (KAERI) is responsible for the design of major systems including the reactor core. While the PDCS (Project Document Control System) being operated by EPC company controls all the documents of the whole Project, KAERI is supposed to have its own system for KAERI documents. Meeting such a need; KAERI has implemented a document control for the JRTR Project into already existing ANSIM (KAERI Advanced Nuclear Safety Information Management) system. The documents of JRTR project to be controlled are defined in the PPM (Project Procedures Manual), QAP (Quality Assurance Procedure) and PEP (Project Execution Program). The ANSIM consists of the document management holder, document container holder and organization management holder. The document management holder, which is the most important part of ANSIM-JRTR, consists of the DDA (Document Distribution for Agreement), IOC (Inter-office Correspondence), PM Memo. (Project Manager Memorandum) and cover sheets of design documents. Other materials such as meeting minutes, sub-department materials and design information materials are stored in an independent COP (Community of Practice). This established computerized document control system, ANSIM, could lessen a burden for project management team and enhance the productivity as well.

An Efficient kNN Algorithm (효율적인 kNN 알고리즘)

  • Lee Jae Moon
    • The KIPS Transactions:PartB
    • /
    • v.11B no.7 s.96
    • /
    • pp.849-854
    • /
    • 2004
  • This paper proposes an algorithm to enhance the execution time of kNN in the document classification. The proposed algorithm is to enhance the execution time by minimizing the computing cost of the similarity between two documents by using the list of pairs, while the conventional kNN uses the iist of pairs. The 1ist of pairs can be obtained by applying the matrix transposition to the list of pairs at the training phase of the document classification. This paper analyzed the proposed algorithm in the time complexity and compared it with the conventional kNN. And it compared the proposed algorithm with the conventional kNN by using routers-21578 data experimentally. The experimental results show that the proposed algorithm outperforms kNN about $90{\%}$ in terms of the ex-ecution time.

Machine Learning Based Automatic Categorization Model for Text Lines in Invoice Documents

  • Shin, Hyun-Kyung
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.12
    • /
    • pp.1786-1797
    • /
    • 2010
  • Automatic understanding of contents in document image is a very hard problem due to involvement with mathematically challenging problems originated mainly from the over-determined system induced by document segmentation process. In both academic and industrial areas, there have been incessant and various efforts to improve core parts of content retrieval technologies by the means of separating out segmentation related issues using semi-structured document, e.g., invoice,. In this paper we proposed classification models for text lines on invoice document in which text lines were clustered into the five categories in accordance with their contents: purchase order header, invoice header, summary header, surcharge header, purchase items. Our investigation was concentrated on the performance of machine learning based models in aspect of linear-discriminant-analysis (LDA) and non-LDA (logic based). In the group of LDA, na$\"{\i}$ve baysian, k-nearest neighbor, and SVM were used, in the group of non LDA, decision tree, random forest, and boost were used. We described the details of feature vector construction and the selection processes of the model and the parameter including training and validation. We also presented the experimental results of comparison on training/classification error levels for the models employed.

An Experimental Study on Feature Selection Using Wikipedia for Text Categorization (위키피디아를 이용한 분류자질 선정에 관한 연구)

  • Kim, Yong-Hwan;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.29 no.2
    • /
    • pp.155-171
    • /
    • 2012
  • In text categorization, core terms of an input document are hardly selected as classification features if they do not occur in a training document set. Besides, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a single feature and by replacing input terms not in the training document set with the most similar term occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the use of category information of non-training terms, the part of Wikipedia used for measuring term-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in $F_1$ value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the threshold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.

Document Summarization via Convex-Concave Programming

  • Kim, Minyoung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.16 no.4
    • /
    • pp.293-298
    • /
    • 2016
  • Document summarization is an important task in various areas where the goal is to select a few the most descriptive sentences from a given document as a succinct summary. Even without training data of human labeled summaries, there has been several interesting existing work in the literature that yields reasonable performance. In this paper, within the same unsupervised learning setup, we propose a more principled learning framework for the document summarization task. Specifically we formulate an optimization problem that expresses the requirements of both faithful preservation of the document contents and the summary length constraint. We circumvent the difficult integer programming originating from binary sentence selection via continuous relaxation and the low entropy penalization. We also suggest an efficient convex-concave optimization solver algorithm that guarantees to improve the original objective at every iteration. For several document datasets, we demonstrate that the proposed learning algorithm significantly outperforms the existing approaches.

Semantic Document-Retrieval Based on Markov Logic (마코프 논리 기반의 시맨틱 문서 검색)

  • Hwang, Kyu-Baek;Bong, Seong-Yong;Ku, Hyeon-Seo;Paek, Eun-Ok
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.6
    • /
    • pp.663-667
    • /
    • 2010
  • A simple approach to semantic document-retrieval is to measure document similarity based on the bag-of-words representation, e.g., cosine similarity between two document vectors. However, such a syntactic method hardly considers the semantic similarity between documents, often producing semantically-unsound search results. We circumvent such a problem by combining supervised machine learning techniques with ontology information based on Markov logic. Specifically, Markov logic networks are learned from similarity-tagged documents with an ontology representing the diverse relationship among words. The learned Markov logic networks, the ontology, and the training documents are applied to the semantic document-retrieval task by inferring similarities between a query document and the training documents. Through experimental evaluation on real world question-answering data, the proposed method has been shown to outperform the simple cosine similarity-based approach in terms of retrieval accuracy.