• Title/Summary/Keyword: document structure

Search Result 592, Processing Time 0.027 seconds

Detection of Malicious PDF based on Document Structure Features and Stream Objects

  • Kang, Ah Reum;Jeong, Young-Seob;Kim, Se Lyeong;Kim, Jonghyun;Woo, Jiyoung;Choi, Sunoh
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.11
    • /
    • pp.85-93
    • /
    • 2018
  • In recent years, there has been an increasing number of ways to distribute document-based malicious code using vulnerabilities in document files. Because document type malware is not an executable file itself, it is easy to bypass existing security programs, so research on a model to detect it is necessary. In this study, we extract main features from the document structure and the JavaScript contained in the stream object In addition, when JavaScript is inserted, keywords with high occurrence frequency in malicious code such as function name, reserved word and the readable string in the script are extracted. Then, we generate a machine learning model that can distinguish between normal and malicious. In order to make it difficult to bypass, we try to achieve good performance in a black box type algorithm. For an experiment, a large amount of documents compared to previous studies is analyzed. Experimental results show 98.9% detection rate from three different type algorithms. SVM, which is a black box type algorithm and makes obfuscation difficult, shows much higher performance than in previous studies.

Document Clustering Method using PCA and Fuzzy Association (주성분 분석과 퍼지 연관을 이용한 문서군집 방법)

  • Park, Sun;An, Dong-Un
    • The KIPS Transactions:PartB
    • /
    • v.17B no.2
    • /
    • pp.177-182
    • /
    • 2010
  • This paper proposes a new document clustering method using PCA and fuzzy association. The proposed method can represent an inherent structure of document clusters better since it select the cluster label and terms of representing cluster by semantic features based on PCA. Also it can improve the quality of document clustering because the clustered documents by using fuzzy association values distinguish well dissimilar documents in clusters. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.

A Study on the Performance of Structured Document Retrieval Using Node Information (노드정보를 이용한 문서검색의 성능에 관한 연구)

  • Yoon, So-Young
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.1 s.63
    • /
    • pp.103-120
    • /
    • 2007
  • Node is the semantic unit and a part of structured document. Information retrieval from structured documents offers an opportunity to go subdivided below the document level in search of relevant information, making any element in an structured document a retrievable unit. The node-based document retrieval constitutes several similarity calculating methods and the extended node retrieval method using structure information. Retrieval performance is hardly influenced by the methods for determining document similarity The extended node method outperformed the others as a whole.

A Method for Automatic Check of Omitted Design Item in Structural Calculation Document of Steel Box Bridges (강박스 교량을 대상으로 한 구조계산서의 누락된 설계항목 검토 자동화 방법론)

  • Park, Sang-Il;An, Hyun-Jung;Kim, Bong-Geun;Lee, Sang-Ho
    • Proceedings of the Computational Structural Engineering Institute Conference
    • /
    • 2007.04a
    • /
    • pp.813-818
    • /
    • 2007
  • A method for automatic check of omitted design item in structural calculation document of steel box bridges is proposed. A method for automatic check of omitted design item in structural calculation document of steel box bridges is proposed. Information processing for the proposed method is divided into two steps: automatic generation of document structure in XML Schema Definition (XSD) format and extract omitted design items by using the XML Schema matching technique. The automatic omitted element filter is developed on the basis of the proposed method, and the accuracy of the developed module is examined with case study subjected to existing structural calculation document samples.

  • PDF

XML Document Retrieval Models for Heterogeneous Data Set using Independent Regular paths (독립적인 질의 경로들을 사용하여 이질적인 문서들을 검색하는 XML 문서 검색 모델)

  • 유신재;민경섭;김형주
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.1_2
    • /
    • pp.140-152
    • /
    • 2003
  • An XML document has a structure which may be irregular. It is difficult for end-users to comprehend the irregular document structure exactly. For these XML documents, an end-user has a difficulty in using structured query. Therefore, an end-user formulates no structured query or a query which has a little structure information. In this context, we propose new retrieval models which use the structured information for ranking and compensate the difference between user query structure and document structure. To ease with querying, we assume the independence among querying paths which represent structural constraints. Since this assumption makes degradation of the expression power of a query language, we also propose a model which overcome this problem. As there had been no test collections for XML documents, we made a small test collection from TIPSTER of the RTEC and experimented on this collection without a structured query, From this experiment, we showed that our models improve average precision about 67% over conventional Vector-Space model.

The XML Compression Algorithm Supporting Query Processing For Compressed Documents (압축된 문서에 대해 질의 처리를 지원하는 XML 압축 알고리즘)

  • 강영준;이석재;유재수
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2003.11a
    • /
    • pp.195-203
    • /
    • 2003
  • With the spread of interment, the digitalization and knowledge-based information are in progress. Specially, numerous users make the various works and use the services on the web. For the most part, these works make use of the XML. The XML shines the reusing of the Documents because it is separated from contents and styles. Also, it can re-define the logic structure of the Document for requirement of the developer. However, the XML document's size is much larger than common text document because it basically handles the document type and adds numerous tags for representing structure of the document. To utilize the limited storage of Palmtop, PDA and so on, it is necessary to compress and handle the documents efficiently. Recently, the compression techniques for efficiently handling and compressing the XML documents are in progress to solve this problem. But the existing research doesn't support the query processing for that. In this paper, we design and implement the XML compression algorithm that compresses the XML document and processes the quay of compressed XML document faster and mote effciently than the previous techniques.

  • PDF

Design and Implementation of a XML Compression Algorithm Supporting Query Processing for Compressed Documents (압축된 문서에 대한 질의 처리를 지원하는 XML 압축 알고리즘의 설계 및 구현)

  • 이석재;강영준;유재수;조기형
    • The Journal of the Korea Contents Association
    • /
    • v.4 no.1
    • /
    • pp.90-99
    • /
    • 2004
  • With the spread of internet, the digitalization and the knowledge informatization are in progress rapidly. Specially, numerous users make the various works and use the services on the web. For the most part, these works make use of the XML The XML shines the reusing of the documents because it is separated from contents and sues. Also, it can re-define the logic structure of the document for requirement of the developer. However, the XML document’s size is much larger than common text document because it handles the document type and adds numerous tags for representing structure of the document. To utilize the limited storage devices of Palmtop, PDA and so u, it is necessary to compress and handle the documents efficiently. Recently, the compression techniques for efficiently handling and compressing the XML documents are under way to solve this problem. But most of the existing researches don't support the query processing for the compressed XML documents. In this paper, we design and implement the XML compression algorithm that compresses the XML document and Processes the query of compressed XML document faster and more efficiently than previous techniques.

  • PDF

A Ranking Technique of XML Documents using Path Similarity for Expanded Query Processing (확장된 질의 처리를 위해 경로간 의미적 유사도를 고려한 XML 문서 순위화 기법)

  • Kim, Hyun-Joo;Park, So-Mi;Park, Seog
    • Journal of KIISE:Databases
    • /
    • v.37 no.2
    • /
    • pp.113-120
    • /
    • 2010
  • XML is broadly using for data storing and processing. XML is specified its structural characteristic and user can query with XPath when information from data document is needed. XPath query can process when the tern and structure of document and query is matched with each other. However, nowadays there are lots of data documents which are made by using different terminology and structure therefore user can not know the exact idea of target data. In fact, there are many possibilities that target data document has information which user is find or a similar ones. Accordingly user query should be processed when their term usage or structural characteristic is slightly different with data document. In order to do that we suggest a XML document ranking method based on path similarity. The method can measure a semantic similarity between user query and data document using three steps which are position, node and relaxation factors.

A Study on PostScript-Converter for conversion XSL-FO into PostScript Format (XSL-FO 문서를 PostScript Format으로 변환하기 위한 PostScript-Converter에 관한 연구)

  • 유동석;김차종
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.8 no.3
    • /
    • pp.614-621
    • /
    • 2004
  • At present, the electronic document is being processed in WYSWYG mode. For this, a document is structured by the logical structure and the physical structure, and is presented by the markup language. After XML is announced, an application scope of the electronic document is extended from interchanging to searching. However, in point of output quality, a XML document image on a browser has lower quality than a general document image on desktop publishing. The reason is which output function of a browser has not capability for high quality printing. The W3C developed XSL-FO(XSL-formatting Object) for style sheet formatting and PDL(Page Description Language) as like Postscript is already developed and used widely. In this paper, we designed the Postscript-Converter to get a high quality document image by converting XSL-FO into Postscript format.

MS Office Malicious Document Detection Based on CNN (CNN 기반 MS Office 악성 문서 탐지)

  • Park, Hyun-su;Kang, Ah Reum
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.2
    • /
    • pp.439-446
    • /
    • 2022
  • Document-type malicious codes are being actively distributed using attachments on websites or e-mails. Document-type malicious code is relatively easy to bypass security programs because the executable file is not executed directly. Therefore, document-type malicious code should be detected and prevented in advance. To detect document-type malicious code, we identified the document structure and selected keywords suspected of being malicious. We then created a dataset by converting the stream data in the document to ASCII code values. We specified the location of malicious keywords in the document stream data, and classified the stream as malicious by recognizing the adjacent information of the malicious keywords. As a result of detecting malicious codes by applying the CNN model, we derived accuracies of 0.97 and 0.92 in stream units and file units, respectively.