• Title/Summary/Keyword: Document structure detection

Search Result 16, Processing Time 0.028 seconds

Forgery Detection Mechanism with Abnormal Structure Analysis on Office Open XML based MS-Word File

  • Lee, HanSeong;Lee, Hyung-Woo
    • International journal of advanced smart convergence
    • /
    • v.8 no.4
    • /
    • pp.47-57
    • /
    • 2019
  • We examine the weaknesses of the existing OOXML-based MS-Word file structure, and analyze how data concealment and forgery are performed in MS-Word digital documents. In case of forgery by including hidden information in MS-Word digital document, there is no difference in opening the file with the MS-Word Processor. However, the computer system may be malfunctioned by malware or shell code hidden in the digital document. If a malicious image file or ZIP file is hidden in the document by using the structural vulnerability of the MS-Word document, it may be infected by ransomware that encrypts the entire file on the disk even if the MS-Word file is normally executed. Therefore, it is necessary to analyze forgery and alteration of digital document through internal structure analysis of MS-Word file. In this paper, we designed and implemented a mechanism to detect this efficiently and automatic detection software, and presented a method to proactively respond to attacks such as ransomware exploiting MS-Word security vulnerabilities.

Detection of Malicious PDF based on Document Structure Features and Stream Objects

  • Kang, Ah Reum;Jeong, Young-Seob;Kim, Se Lyeong;Kim, Jonghyun;Woo, Jiyoung;Choi, Sunoh
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.11
    • /
    • pp.85-93
    • /
    • 2018
  • In recent years, there has been an increasing number of ways to distribute document-based malicious code using vulnerabilities in document files. Because document type malware is not an executable file itself, it is easy to bypass existing security programs, so research on a model to detect it is necessary. In this study, we extract main features from the document structure and the JavaScript contained in the stream object In addition, when JavaScript is inserted, keywords with high occurrence frequency in malicious code such as function name, reserved word and the readable string in the script are extracted. Then, we generate a machine learning model that can distinguish between normal and malicious. In order to make it difficult to bypass, we try to achieve good performance in a black box type algorithm. For an experiment, a large amount of documents compared to previous studies is analyzed. Experimental results show 98.9% detection rate from three different type algorithms. SVM, which is a black box type algorithm and makes obfuscation difficult, shows much higher performance than in previous studies.

Structure Recognition Method of Invoice Document Image for Document Processing Automation (문서 처리 자동화를 위한 인보이스 이미지의 구조 인식 방법)

  • Dong-seok Lee;Soon-kak Kwon
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.28 no.2
    • /
    • pp.11-19
    • /
    • 2023
  • In this paper, we propose the methods of invoice document structure recognition and of making a spreadsheet electronic document. The texts and block location information of word blocks are recognized by an optical character recognition engine through deep learning. The word blocks on the same row and same column are found through their coordinates. The document area is divided through arrangement information of the word blocks. The character recognition result is inputted in the spreadsheet based on the document structure. In simulation result, the item placement through the proposed method shows an average accuracy of 92.30%.

Structure Recognition Method in Various Table Types for Document Processing Automation (문서 처리 자동화를 위한 다양한 표 유형에서 표 구조 인식 방법)

  • Lee, Dong-Seok;Kwon, Soon-Kak
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.5
    • /
    • pp.695-702
    • /
    • 2022
  • In this paper, we propose the method of a table structure recognition in various table types for document processing automation. A table with items surrounded by ruled lines are analyzed by detecting horizontal and vertical lines for recognizing the table structure. In case of a table with items separated by spaces, the table structure are recognized by analyzing the arrangement of row items. After recognizing the table structure, the areas of the table items are input into OCR engine and the character recognition result output to a text file in a structured format such as CSV or JSON. In simulation results, the average accuracy of table item recognition is about 94%.

MS Office Malicious Document Detection Based on CNN (CNN 기반 MS Office 악성 문서 탐지)

  • Park, Hyun-su;Kang, Ah Reum
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.2
    • /
    • pp.439-446
    • /
    • 2022
  • Document-type malicious codes are being actively distributed using attachments on websites or e-mails. Document-type malicious code is relatively easy to bypass security programs because the executable file is not executed directly. Therefore, document-type malicious code should be detected and prevented in advance. To detect document-type malicious code, we identified the document structure and selected keywords suspected of being malicious. We then created a dataset by converting the stream data in the document to ASCII code values. We specified the location of malicious keywords in the document stream data, and classified the stream as malicious by recognizing the adjacent information of the malicious keywords. As a result of detecting malicious codes by applying the CNN model, we derived accuracies of 0.97 and 0.92 in stream units and file units, respectively.

Line Edge-Based Type-Specific Corner Points Extraction for the Analysis of Table Form Document Structure (표 서식 문서의 구조 분석을 위한 선분 에지 기반의 유형별 꼭짓점 검출)

  • Jung, Jae-young
    • Journal of Digital Contents Society
    • /
    • v.15 no.2
    • /
    • pp.209-217
    • /
    • 2014
  • It is very important to classify a lot of table-form documents into the same type of classes or to extract information filled in the template automatically. For these, it is necessary to accurately analyze table-form structure. This paper proposes an algorithm to extract corner points based on line edge segments and to classify the type of junction from table-form images. The algorithm preprocesses image through binarization, skew correction, deletion of isolated small area of black color because that they are probably generated by noises.. And then, it processes detections of edge block, line edges from a edge block, corner points. The extracted corner points are classified as 9 types of junction based on the combination of horizontal/vertical line edge segments in a block. The proposed method is applied to the several unconstraint document images such as tax form, transaction receipt, ordinary document containing tables, etc. The experimental results show that the performance of point detection is over 99%. Considering that almost corner points make a correspondence pair in the table, the information of type of corner and width of line may be useful to analyse the structure of table-form document.

An effective detection method for hiding data in compound-document files (복합문서 파일에 은닉된 데이터 탐지 기법에 대한 연구)

  • Kim, EunKwang;Jeon, SangJun;Han, JaeHyeok;Lee, MinWook;Lee, Sangjin
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.25 no.6
    • /
    • pp.1485-1494
    • /
    • 2015
  • Traditionally, data hiding has been done mainly in such a way that insert the data into the large-capacity multimedia files. However, the document files of the previous versions of Microsoft Office 2003 have been used as cover files as their structure are so similar to a File System that it is easy to hide data in them. If you open a compound-document file which has a secret message hidden in it with MS Office application, it is hard for users who don't know whether a secret message is hidden in the compound-document file to detect the secret message. This paper presents an analysis of Compound-File Binary Format features exploited in order to hide data and algorithms to detect the data hidden with these exploits. Studying methods used to hide data in unused area, unallocated area, reserved area and inserted streams led us to develop an algorithm to aid in the detection and examination of hidden data.

Digital Forensics of Microsoft Office 2007-2013 Documents to Prevent Covert Communication

  • Fu, Zhangjie;Sun, Xingming;Xi, Jie
    • Journal of Communications and Networks
    • /
    • v.17 no.5
    • /
    • pp.525-533
    • /
    • 2015
  • MS Office suit software is the most widely used electronic documents by a large number of users in the world, which has absolute predominance in office software market. MS Office 2007-2013 documents, which use new office open extensible markup language (OOXML) format, could be illegally used as cover mediums to transmit secret information by offenders, because they do not easily arouse others suspicion. This paper proposes nine forensic methods and an integrated forensic tool for OOXML format documents on the basis of researching the potential information hiding methods. The proposed forensic methods and tool cover three categories; document structure, document content, and document format. The aim is to prevent covert communication and provide security detection technology for electronic documents downloaded by users. The proposed methods can prevent the damage of secret information embedded by offenders. Extensive experiments based on real data set demonstrate the effectiveness of the proposed methods.

Design and Analysis of the Web Stegodata Detection Systems using the Intrusion Detection Systems (침입탐지 시스템을 이용한 웹 스테고데이터 검출 시스템 설계 및 분석)

  • Do, Kyoung-Hwa;Jun, Moon-Seog
    • The KIPS Transactions:PartC
    • /
    • v.11C no.1
    • /
    • pp.39-46
    • /
    • 2004
  • It has been happening to transfer not only the general information but also the valuable information through the universal Internet. So security accidents as the expose of secret data and document increase. But we don't have stable structure for transmitting important data. Accordingly, in this paper we intend to use network based Intrusion Detection System modules and detect the extrusion of important data through the network, and propose and design the method for investigating concealment data to protect important data and investigate the secret document against the terrorism. We analyze the method for investigating concealment data, especially we use existing steganalysis techniques, so we propose and design the module emphasizing on the method for investigating stego-data in E-mail of attach files or Web-data of JPG, WAVE etc. Besides, we analyze the outcome through the experiment of the proposed stego-data detection system.

A Study on Effective Internet Data Extraction through Layout Detection

  • Sun Bok-Keun;Han Kwang-Rok
    • International Journal of Contents
    • /
    • v.1 no.2
    • /
    • pp.5-9
    • /
    • 2005
  • Currently most Internet documents including data are made based on predefined templates, but templates are usually formed only for main data and are not helpful for information retrieval against indexes, advertisements, header data etc. Templates in such forms are not appropriate when Internet documents are used as data for information retrieval. In order to process Internet documents in various areas of information retrieval, it is necessary to detect additional information such as advertisements and page indexes. Thus this study proposes a method of detecting the layout of Web pages by identifying the characteristics and structure of block tags that affect the layout of Web pages and calculating distances between Web pages. This method is purposed to reduce the cost of Web document automatic processing and improve processing efficiency by providing information about the structure of Web pages using templates through applying the method to information retrieval such as data extraction.

  • PDF