[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7472/jksii.2018.19.5.33

The Identification Framework for source code author using Authorship Analysis and CNN

Shin, Gun-Yoon (Department of Computer Engineering, Gachon University)
Kim, Dong-Wook (Department of Computer Engineering, Gachon University)
Hong, Sung-sam (Department of Computer Engineering, Gachon University)
Han, Myung-Mook (Department of Computer Engineering, Gachon University)

Publication Information

Journal of Internet Computing and Services / v.19, no.5, 2018 , pp. 33-41 More about this Journal

Abstract

Recently, Internet technology has developed, various programs are being created and therefore various codes are being made through many authors. On this aspect, some author deceive a program or code written by other particular author as they make it themselves and use other writers' code indiscriminately, or not indicating the exact code which has been used. Due to this makes it more and more difficult to protect the code. In this paper, we propose author identification framework using Authorship Analysis theory and Natural Language Processing(NLP) based on Convolutional Neural Network(CNN). We apply Authorship Analysis theory to extract features for author identification in the source code, and combine them with the features being used text mining to perform author identification using machine learning. In addition, applying CNN based natural language processing method to source code for code author classification. Therefore, we propose a framework for the identification of authors using the Authorship Analysis theory and the CNN. In order to identify the author, we need special features for identifying the authors only, and the NLP method based on the CNN is able to apply language with a special system such as source code and identify the author. identification accuracy based on Authorship Analysis theory is 95.1% and identification accuracy applied to CNN is 98%.

Keywords

Author Identification; Authorship Analysis; Convolutional Neural Network; Machine Learning; Code Analysis;

Citations & Related Records

Reference

1	Python, "https://www.python.org/"
2	scikit-learn, "http://scikit-learn.org/stable/"
3	Google Code Jam, "https://code.google.com/codejam/"
4	Github, "https://github.com/"
5	S. Burrows, M. Tahaghoghi, "Source Code Authorship Attribution using n-grams", In Proc. of the Australasian Document Computing Symposium, 2007. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5920
6	J. Houbardas and E. Stamatatos, "N-gram Features Selection for Authorship Identification", AIMSA, pp 77-86, 2006. https://doi.org/10.1007/11861461_10 DOI
7	J. Kothari, M. Shevertalov, E. Stehle, S. Mancoridis, "A Probabilistic Approach to Source Code Authorship Identification", Information Technology, 2007. https://doi.org/10.1109/itng.2007.17 DOI
8	A. Caliskan, F. Yamaguchi, E. Dauber, R. Harangm K. Rieck, R. Greenstadt and A. Narayanan, "When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries", arXiv, 2016. https://doi.org/10.14722/ndss.2018.23304
9	G. Frantzeskou, G. MacDonell and E. Stamatatos, "Source code authorship analysis for supporting the cybercrime investigation process", INSTICC, pp 85-92, 2004. https://doi.org/10.5220/0001390300850092 DOI
10	N. Rosenblum, P. Miller and X. Zhu, "Recovering the Toolchain Provenance of Binary Code", International Symposium on Software Testing and Analysis, pp 100-110, 2011. https://doi.org/10.1145/2001420.2001433 DOI
11	N. Rosenblum, X. Zhu and B. Miller, "Learning to Analyze Binary Computer Code", AAAI Conference on Artificial Intelligence, 2008. Computer Security -ESORICS, pp 172-189, 2011. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.146.1395
12	N. Rosenblum, X. Zhu and B. Miller, "Who wrote this code? identifying the authors of program binaries", Computer Security - ESORICS, 99 172-189, 2011. https://doi.org/10.1007/978-3-642-23822-2_10 DOI
13	M. Barreno, B. Nelson, D. Joseph and D. Tygar, "The security of machine learning", Machine Learning, Vol 81, Issue 2, pp 121-148, 2010. https://link.springer.com/article/10.1007/s10994-010-5188-5 DOI
14	D. Joseph, L. Pavel, R. Fabio, J. Doug, N. Blaine, "Machine Learning Methods for Computer Security", Dagstuhl Perspectives Workshop, 2013.
15	A. Abbasi and H. Chen, "Applying authorship analysis to extremist-group web forum messages", IEEE Intelligent Systems, Vol 20, Issue 5, pp 67-75, 2005. https://doi.org/10.1109/mis.2005.81
16	S. Alraba, P. Shirani, M. Debbabi, L. Wang, "On the Feasibility of Malware Authorship Attribution", Foundations and Practice of Security, pp 256-272, 2016. https://doi.org/10.1007/978-3-319-51966-1_17 DOI
17	E. Stamatatos, "A Survey of Modern Authorship Attribution Methods", American Society for Information Science and Technology, Vol 60, Issue 3, pp 538-556, 2009. https://doi.org/10.1002/asi.21001 DOI
18	I. Krsul, H. Spafford, "Authorship Analysis: identifying the author of a program", Computer & Security, pp 233-257, 1997. https://doi.org/10.1016/0167-4048(96)81683-x DOI
19	G. Andrew, S. Philip, M. Stephen, "Software Forensics Extending Authorship Analysis Techniques to Computer Programs", Information Science, 1997. http://hdl.handle.net/10523/872
20	H. Spafford, A. Weeber, "Software Forensics Can We Track Code to its Authors?", Computers & Security, Vol 12, issue 6, pp 585-595, 1993. https://doi.org/10.1016/0167-4048(93)90055-a DOI
21	D. Britz, "Understanding Convolutional Neural Networks for NLP", WILDML, 2015. http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
22	M. Moreno, J. Kalita, "Deep Learning applied to NLP", arXiv, 2017. https://arxiv.org/abs/1703.03091
23	Y. Kim. "Convolutional Neural Networks for Sentence Classification", Empirical Methods on Natural Language Processing, 2014. https://doi.org/10.3115/v1/d14-1181 DOI
24	W. Yin, K. Kann, M. Yu and H. Schutze, "Comparative Study of CNN and RNN for Natural Language Processing", arXiv, 2017. https://arxiv.org/abs/1702.01923

KSCI

The Identification Framework for source code author using Authorship Analysis and CNN 작성자 분석과 CNN을 적용한 소스 코드 작성자 식별 프레임워크

The Identification Framework for source code author using Authorship Analysis and CNN