• Title/Summary/Keyword: Document Model

Search Result 847, Processing Time 0.028 seconds

Design of Web Content Model (웹 컨텐트 저장소)

  • Abbass, Onytra;Koo, Heung-Seo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2002.11c
    • /
    • pp.1915-1918
    • /
    • 2002
  • Managing semistructured data needs fine granularity such as markup elements. XML has major effect in managing web content, it enables content reusability, enriches information with metadata, ensures valid document links, etc. We introduce our content model as an integrated work which handles content objects as controllable units. The paper concerns on modeling news site and how the content is classified due to the site structure, aggregated content and reusability. The model stores instance XML document into relation database using fragmentation strategy.

  • PDF

Latent Semantic Analysis Approach for Document Summarization Based on Word Embeddings

  • Al-Sabahi, Kamal;Zuping, Zhang;Kang, Yang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.1
    • /
    • pp.254-276
    • /
    • 2019
  • Since the amount of information on the internet is growing rapidly, it is not easy for a user to find relevant information for his/her query. To tackle this issue, the researchers are paying much attention to Document Summarization. The key point in any successful document summarizer is a good document representation. The traditional approaches based on word overlapping mostly fail to produce that kind of representation. Word embedding has shown good performance allowing words to match on a semantic level. Naively concatenating word embeddings makes common words dominant which in turn diminish the representation quality. In this paper, we employ word embeddings to improve the weighting schemes for calculating the Latent Semantic Analysis input matrix. Two embedding-based weighting schemes are proposed and then combined to calculate the values of this matrix. They are modified versions of the augment weight and the entropy frequency that combine the strength of traditional weighting schemes and word embedding. The proposed approach is evaluated on three English datasets, DUC 2002, DUC 2004 and Multilingual 2015 Single-document Summarization. Experimental results on the three datasets show that the proposed model achieved competitive performance compared to the state-of-the-art leading to a conclusion that it provides a better document representation and a better document summary as a result.

MS Office Malicious Document Detection Based on CNN (CNN 기반 MS Office 악성 문서 탐지)

  • Park, Hyun-su;Kang, Ah Reum
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.2
    • /
    • pp.439-446
    • /
    • 2022
  • Document-type malicious codes are being actively distributed using attachments on websites or e-mails. Document-type malicious code is relatively easy to bypass security programs because the executable file is not executed directly. Therefore, document-type malicious code should be detected and prevented in advance. To detect document-type malicious code, we identified the document structure and selected keywords suspected of being malicious. We then created a dataset by converting the stream data in the document to ASCII code values. We specified the location of malicious keywords in the document stream data, and classified the stream as malicious by recognizing the adjacent information of the malicious keywords. As a result of detecting malicious codes by applying the CNN model, we derived accuracies of 0.97 and 0.92 in stream units and file units, respectively.

A Study on the Model Development of Digital Document Information Center for the National Science, Technology and Industry Information (국가과학기술산업 디지털문헌정보센터 모델 정립에 관한 연구)

  • Kim, Sung-Hyuk;Lee, Hye-Jin
    • Journal of Information Management
    • /
    • v.33 no.1
    • /
    • pp.18-30
    • /
    • 2002
  • This study is to propose the model of Digital Document Information Center at KISTI which is a representative center of science, technology and industry information in Korea. The model proposed at this study is reflected current information environment such as digitalization, Internet and various internal and external situations. Also, the study is proposed the medium and long range plans for the center.

Purchase Information Extraction Model From Scanned Invoice Document Image By Classification Of Invoice Table Header Texts (인보이스 서류 영상의 테이블 헤더 문자 분류를 통한 구매 정보 추출 모델)

  • Shin, Hyunkyung
    • Journal of Digital Convergence
    • /
    • v.10 no.11
    • /
    • pp.383-387
    • /
    • 2012
  • Development of automated document management system specified for scanned invoice images suffers from rigorous accuracy requirements for extraction of monetary data, which necessiate automatic validation on the extracted values for a generative invoice table model. Use of certain internal constraints such as "amount = unit price times quantity" is typical implementation. In this paper, we propose a noble invoice information extraction model with improved auto-validation method by utilizing table header detection and column classification.

CALS System Development Methodology Using Document Trace Diagram and IDEF Model (Document Trace Diagram 과 IDEF 모델을 이용한 CALS 시스템 개발 방법론)

  • Kim, Soung-Hie;Cho, Sung-Sik;Lee, Jae-Kwang;Han, Chang-Hee;Yoon, Young-Suk
    • Asia pacific journal of information systems
    • /
    • v.8 no.3
    • /
    • pp.37-49
    • /
    • 1998
  • The basic goal of CALS is to improve transactions and relationships among organizations through information sharing and integration. CALS is an information strategy which needs strong cooperation between organizations or between users and developers in design step. However, current design methodologies using IDEF models, that are considered to be standard for CALS system development, has some limitations. For example, it is difficult for system developers to communicate with counterparts by IDEF model since IDEF models are difficult for counterparts to understand. In this paper, we suggest a development methodology for GALS systems by complementing IDEF model with Document Trace Diagram, which we developed as a communication tool, The concept of Document Trace Diagram stems from the fact that most information exchanged within or between organizations is in the form of documents and most standard operating procedures of organizations are about processing the documents. It helps system developers identify functions and their ICOMs (Input, Control, Output, Mechanism) with ease and little communication cost. With this methodology, we have constructed the GALS prototype system for construction industry.

  • PDF

The Utilization of Local Document Information to Improve Statistical Context-Sensitive Spelling Error Correction (통계적 문맥의존 철자오류 교정 기법의 향상을 위한 지역적 문서 정보의 활용)

  • Lee, Jung-Hun;Kim, Minho;Kwon, Hyuk-Chul
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.7
    • /
    • pp.446-451
    • /
    • 2017
  • The statistical context-sensitive spelling correction technique in this thesis is based upon Shannon's noisy channel model. The interpolation method is used for the improvement of the correction method proposed in the paper, and the general interpolation method is to fill the middle value of the probability by (N-1)-gram and (N-2)-gram. This method is based upon the same statistical corpus. In the proposed method, interpolation is performed using the frequency information between the statistical corpus and the correction document. The advantages of using frequency of correction documents are twofold. First, the probability of the coined word existing only in the correction document can be obtained. Second, even if there are two correction candidates with ambiguous probability values, the ambiguity is solved by correcting them by referring to the correction document. The method proposed in this thesis showed better precision and recall than the existing correction model.

Speed-up of Document Image Binarization Method Based on Water Flow Model (Water flow model에 기반한 문서영상 이진화 방법의 속도 개선)

  • 오현화;김도훈;이재용;김두식;임길택;진성일
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.41 no.4
    • /
    • pp.75-86
    • /
    • 2004
  • This paper proposes a method to speed up the document image binarization using a water flow model. The proposed method extracts the region of interest (ROI) around characters from a document image and restricts pouring water onto a 3-dimensional terrain surface of an image only within the ROI. The amount of water to be filed into a local valley is determined automatically depending on its depth and slope. The proposed method accumulates weighted water not only on the locally lowest position but also on its neighbors. Therefore, a valley is filed enough with only one try of pouring water onto the terrain surface of the ROI. Finally, the depth of each pond is adaptively thresholded for robust character segmentation, because the depth of a pond formed at a valley varies widely according to the gray-level difference between characters and backgrounds. In our experiments on real document images, the Proposed method has attained good binarization performance as well as remarkably reduced processing time compared with that of the existing method based on a water flow model.

Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec (Doc2Vec과 Word2Vec을 활용한 Convolutional Neural Network 기반 한국어 신문 기사 분류)

  • Kim, Dowoo;Koo, Myoung-Wan
    • Journal of KIISE
    • /
    • v.44 no.7
    • /
    • pp.742-747
    • /
    • 2017
  • In this paper, we propose a novel approach to improve the performance of the Convolutional Neural Network(CNN) word embedding model on top of word2vec with the result of performing like doc2vec in conducting a document classification task. The Word Piece Model(WPM) is empirically proven to outperform other tokenization methods such as the phrase unit, a part-of-speech tagger with substantial experimental evidence (classification rate: 79.5%). Further, we conducted an experiment to classify ten categories of news articles written in Korean by feeding words and document vectors generated by an application of WPM to the baseline and the proposed model. From the results of the experiment, we report the model we proposed showed a higher classification rate (89.88%) than its counterpart model (86.89%), achieving a 22.80% improvement. Throughout this research, it is demonstrated that applying doc2vec in the document classification task yields more effective results because doc2vec generates similar document vector representation for documents belonging to the same category.

Document classification using a deep neural network in text mining (텍스트 마이닝에서 심층 신경망을 이용한 문서 분류)

  • Lee, Bo-Hui;Lee, Su-Jin;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.5
    • /
    • pp.615-625
    • /
    • 2020
  • The document-term frequency matrix is a term extracted from documents in which the group information exists in text mining. In this study, we generated the document-term frequency matrix for document classification according to research field. We applied the traditional term weighting function term frequency-inverse document frequency (TF-IDF) to the generated document-term frequency matrix. In addition, we applied term frequency-inverse gravity moment (TF-IGM). We also generated a document-keyword weighted matrix by extracting keywords to improve the document classification accuracy. Based on the keywords matrix extracted, we classify documents using a deep neural network. In order to find the optimal model in the deep neural network, the accuracy of document classification was verified by changing the number of hidden layers and hidden nodes. Consequently, the model with eight hidden layers showed the highest accuracy and all TF-IGM document classification accuracy (according to parameter changes) were higher than TF-IDF. In addition, the deep neural network was confirmed to have better accuracy than the support vector machine. Therefore, we propose a method to apply TF-IGM and a deep neural network in the document classification.