Browse > Article
http://dx.doi.org/10.9728/dcs.2018.19.7.1357

Similarity Analysis of Programs through Linear Regression of Code Distribution  

Lim, Hyun-il (Department of Computer Engineering, Kyungnam University)
Publication Information
Journal of Digital Contents Society / v.19, no.7, 2018 , pp. 1357-1363 More about this Journal
Abstract
In addition to advances in information technology, machine learning approach is applied to a variety of applications, and is expanding to a variety of areas. In this paper, we propose a software analysis method that applies linear regression to analyse software similarity from the code distribution of the software. The characteristics of software can be expressed by instructions contained within the program, so the distribution information of instructions is used as learning data. In addition, a learning procedure with the learning data generates a linear regression model for software similarity analysis. The proposed method is evaluated with real world Java applications. The proposed method is expected to be used as a basic technique to determine similarity of software. It is also expected to be applied to various software analysis techniques through machine learning approaches.
Keywords
Code analysis; Code distribution; Linear regression; Machine learning; Similarity analysis;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Michael J. Wise, "Yap3: Improved detection of similarities in computer program and other texts," In Proceedings of the 27th SIGCSE Technical Symposium on Computer Science Education, pages 130-134, 1996.
2 Ginger Myles and Christian Collberg, "k-gram based software birthmarks," In Proceedings of the 2005 ACM Symposium on Applied Computing, pages 314-318, 2005.
3 Krinke J, "Identifying similar code with program dependence graphs," In Working Conference on Reverse Engineering 2001, pp. 301-309, 2001.
4 Ginger Myles and Christian Collberg, "Detecting software theft via whole program path birthmarks," In International Conference on Information Security (ISC 2004), LNCS 3225, pp. 404-415, 2004.
5 Hyun-il Lim, "An Effective Method for Comparing Control Flow Graphs through Edge Extension," KIPS Transactions on Computer and Communication Systems, Vol 2, No. 8, Aug. 2013.
6 Kevin P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012.
7 Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.
8 Pedro Domingos, “A few useful things to know about machine learning,” Communications of the ACM, Vol. 55, No. 10, pp. 78-87, 2012.   DOI
9 linear regression, Wikipedia [Internet]. Available: https://en.wikipedia.org/wiki/Linear_regression
10 Least squares, Wikipedia [Internet]. Available: https://en.wikipedia.org/wiki/Least_squares
11 Binary file, Wikipedia [Internet]. Available: https://en.wikipedia.org/wiki/Binary_file
12 The class File Format, Java SE Specification, Oracle [Internet]. Available: https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html
13 Denis N. Antonioli and Markus Pilz, "Analysis of the Java Class File Format," Technical Report 98.4, Department of Computer Science, University of Zurich, 1998.
14 Python [Internet]. Available: https://www.python.org/
15 scikit-learn: Machine Learning in Python [Internet]. Available: http://scikit-learn.org/stable/index.html
16 The Jakarta-ORO [Internet]. Available: https://jakarta.apache.org/oro/
17 Smokescreen - Java obfuscator, http://www.javadevelopmentindia.com/technology-amp-integration/technology-amp-integration/obfustication-amp-decompiling/smokescreen/
18 ANTLR (ANother Tool for Language Recognition) [Internet]. Available: http://www.antlr.org/
19 Chang-Sik Kim, Su-Jung Choi, Kee-Young Kwahk, “Investigation of Research Trends in Information Systems Domain Using Topic Modeling and Time Series Regression Analysis,” Journal of Digital Contents Society, Vol. 18, No. 6, pp. 1143-1150, Oct. 2017.