DOI QR코드

DOI QR Code

A Plagiarism Detection Technique for Source Codes Considering Data Structures

데이터 구조를 고려한 소스코드 표절 검사 기법

  • 이기화 (슈어소프트테크) ;
  • 김연어 (부산대학교 전자전기컴퓨터공학과) ;
  • 우균 (부산대학교 전자전기컴퓨터공학과, LG전자 스마트제어센터 행정관리부)
  • Received : 2014.02.10
  • Accepted : 2014.04.29
  • Published : 2014.06.30

Abstract

Though the plagiarism is illegal and should be avoided, it still occurs frequently. Particularly, the plagiarism of source codes is more frequently committed than others since it is much easier to copy them because of their digital nature. To prevent code plagiarism, there have been reported a variety of studies. However, previous studies for plagiarism detection techniques on source codes do not consider the data structures although a source code consists both of data structures and algorithms. In this paper, a plagiarism detection technique for source codes considering data structures is proposed. Specifically, the data structures of two source codes are represented as sets of trees and compared with each other using Hungarian Method. To show the usefulness of this technique, an experiment has been performed on 126 source codes submitted as homework results in an object-oriented programming course. When both the data structures and the algorithms of the source codes are considered, the precision and the F-measure score are improved 22.6% and 19.3%, respectively, than those of the case where only the algorithms are considered.

표절은 불법이고 피해야 하지만 여전히 빈번하게 발생하고 있다. 특히, 소스코드 표절은 그 특성상 복사가 용이해 다른 저작물보다 더 빈번히 발생한다. 코드 표절을 방지하기 위한 다양한 연구가 있었다. 하지만 앞서 연구된 소스코드 표절 검사 기법을 살펴보면 프로그램이 알고리즘과 데이터 구조로 구성됨에도 불구하고 데이터 구조는 전혀 고려하지 않고 있다. 이 논문에서는 데이터 구조를 고려한 소스코드 표절 검사 기법을 제안한다. 구체적으로 말해서 두 소스코드의 데이터 구조를 트리 집합으로 나타내고, 헝가리안 메소드를 사용해 비교한다. 제안하는 기법의 효용성을 보이기 위해 객체지향 교과목에서 과제 답안으로 제출한 126개의 소스코드를 대상으로 실험하였다. 실험 결과 데이터 구조와 알고리즘을 모두 고려했을 때, 알고리즘만 고려한 경우보다 정확률과 F-measure가 각각 22.6%, 19.3% 향상됨을 보였다.

Keywords

References

  1. Stefan Bellon, Rainer Koschke, Giuliano Antoniol, Jens Krinke, and Ettore Merlo, "Comparison and evaluation of clone detection tools," IEEE Transactions on Software Engineering, Vol.33, No.9, pp.577-591, 2007. https://doi.org/10.1109/TSE.2007.70725
  2. Chanchal K. Roy, James R. Cordy, and Rainer Koschke, "Comparison and evaluation of code clone detection techniques and tools: A qualitative approach," Science of Computer Programming, Vol.74, No.7, pp.470-495, 2009. https://doi.org/10.1016/j.scico.2009.02.007
  3. Lutz Prechelt, Guido Malpohl, and Michael Philippsen, "Finding plagiarisms among a set of program with JPlag," Journal of Universal Computer Science, Vol.8, No.11, pp.1016-1038, 2002.
  4. Michael J. Wise, "Neweyes: a system for comparing biological sequences using the running karp-rabin greedy string-tiling algorithm," In Intelligent Systems in Molecular Biology, pp.393-401, 1995.
  5. Jeong-Hoon Ji, Program similarity analysis framework using adaptive sequence alignment technique, PhD thesis, Pusan National University, 2010.
  6. Jeong-Hoon Ji, Gyun Woo, Sang-Hyun Park, and Hwan-Gue Cho, "An intelligent system for detecting source code plagiarism using a probabilistic graph model," In Machine Learning and Data Mining in Pattern Recognitions Posters, pp.55-69, 2007.
  7. Yun-Jung Lee, Jin-Su Lim, Jeong-Hoon Ji, Hwan-Gue Cho, and Gyun Woo, "Plagiarism detection among source codes using adaptive methods," Transactions on Internet and Information Systems, Vol.6, No.6, pp.1627-1648, 2012. https://doi.org/10.3837/tiis.2012.06.008
  8. Jeong-Hoon Ji, Gyun Woo and Hwan-Gue Cho, "A source code linearization technique for detecting plagiarized programs," In Proceedings of the 12th annual SIGCSE conference on Innovation and technology in computer science education, pp.73-77, 2007.
  9. Niklaus Wirth, Algorithms + Data Structures = Programs, Prentice Hall, 1976.
  10. Karl J. Ottenstein, "An algorithmic approach to the detection and prevention of plagiarism," ACM SIGCSE Bulletin, Vol.8, No.4, pp.30-44, 1976.
  11. Maurice H. Halstead, "Elements of Software Science (Operating and programming systems series)," Elsevier Science Inc., 1977.
  12. Hal L. Berghel and David L. Sallach, "Measurements of program similarity in identical task environments," ACM SIGPLAN Notices, Vol.19, No.8, pp.65-76, 1984.
  13. Sam Grier, "A tool that detects plagiarism in pascal programs," In ACM SIGCSE Bulletin, Vol.13, pp.15-20, 1981.
  14. Stephane Ducasse, Oscar Nierstrasz, and Matthias Rieger, "On the effectiveness of clone detection by string matching," Journal of Software Maintenance and Evolution: Research and Practice, Vol.18, No.1, pp.37-58, 2006. https://doi.org/10.1002/smr.317
  15. J. Howard Johnson, "Identifying redundancy in source code using fingerprints," In Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Software Engineering-Volume 1, pp.171-183, IBM Press, 1993.
  16. Jin-Su Lim, A code plagiarism detection system considering the coding style, Master thesis, Pusan National University, 2012.
  17. David gitchell and Nicholas Tran, "Sim: a utility for detecting similarity in computer programs," In ACM SIGCSE Bulletin, Vol.31, pp.266-270, ACM, 1999.
  18. Michael J. Wise, "Yap3: Improved detection of similarities in computer program and other texts," In ACM SIGCSE Bulletin, Vol.28, pp.130-134, ACM, 1996.
  19. Michel Chilowicz, Etienne Duris, and Gilles Roussel, "Syntax tree fingerprinting for source code similarity detection," In 17th IEEE International Conference on Program Compre hension, pp.243-247, IEEE, 2009.
  20. Raimar Falke, Pierre Frenzel, and Rainer Koschke, "Empirical evaluation of clone detection using syntax suffix trees," Empirical Software Engineering, Vol.13, No.6, pp.601-643, 2008. https://doi.org/10.1007/s10664-008-9073-9
  21. Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu, "Deckard: Scalable and accurate treebased detection of code clones," In Proceedings of the 29th international conference on Software Engineering, pp.96-105, IEEE, 2007.
  22. Ira D. Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier, "Clone detection using abstract syntax trees," In International Conference on Software Maintenance, pp.368-377, IEEE, 1998.
  23. Temple F. Smith and Michael S. Waterman, "Identification of common molecular subsequences," Journal of Molecular Biology, Vol.147, No.1, pp.195-197, 1981. https://doi.org/10.1016/0022-2836(81)90087-5
  24. Harold W. Kuhn, "Variants of the hungarian method for assignment problems," Noval Research Logistics Quarterly, Vol.3, No.4, pp.253-258, 1956. https://doi.org/10.1002/nav.3800030404
  25. jgesser, javaparser - Java 1.5 parser and AST [Internet], http://code.google.com/p/javaparser