Browse > Article
http://dx.doi.org/10.13089/JKIISC.2015.25.3.595

A Study on Preprocessing Method for Effective Semantic-based Similarity Measures using Approximate Matching Algorithm  

Kang, Hari (Graduate School of Information Security, Korea University)
Jeong, Doowon (Graduate School of Information Security, Korea University)
Lee, Sangjin (Graduate School of Information Security, Korea University)
Abstract
One of the challenges of the digital forensics is how to handle certain amounts of data efficiently. Although reliable and various approximate matching algorithms have been presented to quickly identify similarities between digital objects, its practical effectiveness to identify the semantic similarity is low because of frequent false positives. To solve this problem, we suggest adding a pre-processing of the approximate matching target dataset to increase matching accuracy while maintaining the reliability of the approximate matching algorithm. To verify the effectiveness, we experimented with two datasets of eml and hwp using sdhash in order to identify the semantic similarity.
Keywords
Approximate Matching; Semantic similarity; Digital forensics;
Citations & Related Records
연도 인용수 순위
  • Reference
1 NIST SP 800-168, "Approximate Matching : Definition and Terminology," Jul. 2014
2 Jesse Kornblum, "Identifying almost identical files using context triggered piecewise hashing," Digital Investigation, vol. 3, pp. 91-97, 2006   DOI
3 Vassil Roussev, "Data fingerprinting with similarity digests," Advances in Digital Forensics VI. IFIP AICT, vol. 337, pp. 207-226, 2010   DOI
4 Vassil Roussev, sdhash v3.4, 2013, http://roussev.net/sdhash
5 Vassil Roussev, "An evaluation of forensic similarity hashes," Digital Investigation, vol. 8, pp. 34-41, Aug. 2011   DOI   ScienceOn
6 Petter Christian Bjelland, Katrin Franke and Andre Arnes, "Practical use of Approximate Hash Based Matching in digital investigations," Digital Investigation, vol. 11, pp. 18-26, May. 2014   DOI
7 N. Borenstein and N. Freed, "MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies," RFC 1521, Sep. 1993
8 P. Resnick, "Internet Message Format." RFC 2822, Apr. 2001
9 William W. Cohen, Enron Email Dataset, https://www.cs.cmu.edu/-/enron
10 Hancom, "Hwp Document File Formats 5.0," 2014, http://www.hancom.com/forMatQna.boardIntro.do
11 Frank Breitinger and Vassil Roussev, "Automated evaluation of approximate matching algorithms on real data," Digital Investigation, vol. 11, pp. 10-17, May. 2014   DOI
12 Frank Breitinger, Georgios Stivaktakis and Vassil Roussev, "Evaluating detection error trade-offs for bytewise approximate matching algorithms," Digital Investigation, vol. 11, pp. 81-89, Jun. 2014   DOI
13 Vassil Roussev and Candice Quates, "Content triage with similarity digests : The M57 case study," Digital Investigation, vol. 9, pp. 60-68, Aug. 2012   DOI