DOI QR코드

DOI QR Code

Automated Detecting and Tracing for Plagiarized Programs using Gumbel Distribution Model

굼벨 분포 모델을 이용한 표절 프로그램 자동 탐색 및 추적

  • 지정훈 (부산대학교 컴퓨터공학과) ;
  • 우균 (부산대학교 컴퓨터공학과) ;
  • 조환규 (부산대학교 컴퓨터공학과)
  • Published : 2009.12.31

Abstract

Studies on software plagiarism detection, prevention and judgement have become widespread due to the growing of interest and importance for the protection and authentication of software intellectual property. Many previous studies focused on comparing all pairs of submitted codes by using attribute counting, token pattern, program parse tree, and similarity measuring algorithm. It is important to provide a clear-cut model for distinguishing plagiarism and collaboration. This paper proposes a source code clustering algorithm using a probability model on extreme value distribution. First, we propose an asymmetric distance measure pdist($P_a$, $P_b$) to measure the similarity of $P_a$ and $P_b$ Then, we construct the Plagiarism Direction Graph (PDG) for a given program set using pdist($P_a$, $P_b$) as edge weights. And, we transform the PDG into a Gumbel Distance Graph (GDG) model, since we found that the pdist($P_a$, $P_b$) score distribution is similar to a well-known Gumbel distribution. Second, we newly define pseudo-plagiarism which is a sort of virtual plagiarism forced by a very strong functional requirement in the specification. We conducted experiments with 18 groups of programs (more than 700 source codes) collected from the ICPC (International Collegiate Programming Contest) and KOI (Korean Olympiad for Informatics) programming contests. The experiments showed that most plagiarized codes could be detected with high sensitivity and that our algorithm successfully separated real plagiarism from pseudo plagiarism.

소프트웨어의 지적 재산권 보호 및 인증에 대한 관심과 중요성이 커지면서 소프트웨어에 대한 표절 탐색 및 보호, 판단에 대한 연구가 활발 하게 진행되고 있다. 지금까지 표절에 대한 연구는 주로 속성 계산, 토큰 패턴, 프로그램 파스트리, 유사도 측정 알고리즘 등을 이용해 두 프로 그램을 비교하는데 초점을 두었다. 이와 더불어, 표절과 협동(collaboration)을 구분하는 것은 표절연구에서 매우 중요하다. 본 논문에서는 극단 치 분포 확률 모델을 이용한 소스코드 클러스터링을 위한 알고리즘을 제안한다. 본 논문에서는 먼저 두 프로그램 먼저 두 프로그램 $P_a$$P_b$ 의 유사도를 측정하는 비대칭거리측정함수 pdist($P_a$, $P_b$)를 제안하고, 모든 소스코드 쌍에 대해 pdist($P_a$, $P_b$)를 통해 측정된 유사도를 간선무게로 하는 표절방 향그래프(PDG)를 생성한다. 그리고 본 논문에서는 표절방향그래프를 굼벨거리그래프(GDG)로 변환한다. pdist($P_a$, $P_b$) 점수 분포는 극단치 확률 분포로 잘 알려진 굼벨분포(Gumbel distribution)와 매우 유사하다. 또한, 본 논문에서는 의사표절(pseudo- plagiarism)을 새롭게 정의한다. 의사표절은 프로그램의 강한 기능적 제약사항으로 인해 발생하는 가상 표절의 한 종류이다. 본 논문에서는 ICPC(International Collegiate Programming Contest)와 KOI(Korean Olympiad for Informatics) 대회에 제출된 18개 프로그램 그룹의 700개 이상의 소스코드에 대해 실험을 진행하였다. 실험결과 프로그램 그룹에 포함된 표절 프로그램들을 찾았으며, 소스코드 클러스터링 알고리즘은 의사표절과 실제표절 프로그램 그룹을 효과적으로 구분하였다.

Keywords

References

  1. S. Eissen and B. Stein. Intrinsic plagiarism detection. In Proceedings of ECIR. Volume 3936 of Lecture Notes in Computer Science., Springer, pages 565-569, 2006
  2. S. Mann and Z. Frew. Similarity and originality in code: plagiarism and normal variation in student assignments. In Proceedings of the 8th Austalian Conference on Computing Education, pages 143-150. 2006
  3. A. Parker and J. O. Hamblen. Computer algorithms for plagiarism detection. IEEE Transaction on Education, 32(2):94-99, 1989 https://doi.org/10.1109/13.28038
  4. B. Cheang, A. Kurnia, A. Lim, and W. Oon. On automated grading of programming assignments in an academic institution. Computer and Education, 41:121-131, 2003 https://doi.org/10.1016/S0360-1315(03)00030-7
  5. A. Knight, K. Almeroth, and B. Bimber. An automated system for plagiarism detection using the internet. In Proceedings of the World Conference on Educational Multimedia, Hypermedia and Telecommunications, 2004 pages 3619-3625, 2004
  6. A. Aiken. Moss(measure of software similarity) plagiarism detection system. 1998
  7. L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, 8(11):1016-1038, 2002
  8. D. Gitchell and N. Tran. Sim: a utility for detecting similarity in computer programs. In Proceedings of the SIGCSE '99: The proceedings of the thirtieth SIGCSE technical symposium on Computer science education, pages 266-270, 1999
  9. M. J. Wise. Detection of similarities in student programs: Yap'ing may be preferable to plague'ing. In Proceedings of the 23rd SIGCSE Technical Symposium, 24(1):268-271, 1992
  10. X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker. Shared information and program plagiarism detection. IEEE Transactions on Information Theory, 50(7):1545-1551, 2004 https://doi.org/10.1109/TIT.2004.830793
  11. C. Daly and J. Horgan. Patterns of plagiarism. SIGCSE Bull., 37(1):383-387, 2005 https://doi.org/10.1145/1047124.1047473
  12. M. Joy and M. Luck. Plagiarism in programming assignments. IEEE Transactions of Education, 42(2):129-133, 1999 https://doi.org/10.1109/13.762946
  13. J. Garter. Collaboration or plagiarism: What happens when students work together. In Proceeding of the 4th Annual SIGCSE/SIGCUE Conference on Innovation and Technology in Computer Science Education(ITICSE-99), volume 31 of SIGCSE Bulletin inroads, pages 52-55, 1999
  14. Plagiarism.org, Site available at http://www.plagiarism.org
  15. Integriguard, Site available at http://www.integriguard.com
  16. EVE2, Site available at http://www.canexus.com/eve
  17. CopyCatch, Site available at http://www.copycatchgold.com
  18. K. Verco and M. Wise. Software for detecting suspected plagiarism: Comparing structure and attribute-counting systems. In Proceedings of the 1st Australian Conference on Computer Science Education, pages 130-134, 1996
  19. J. Donaldson, A. Lancaster, and P. Sposato. A plagiarism detection system. In Proceedings of the 12th SIGCSE Technical Symposium on Computer Science Education. pages 21-25, 1981
  20. 강은미, 황미녕, 조환규. 유전체 서열의 정렬 기법을 이용한 소스 코드 표절 검사. 정보과학회논문지: 컴퓨팅의 실제, 9(3):352-367, June 2003
  21. 지정훈, 우균, 조환규. 제한된 프로그램 소스 집합에서 표절 탐색을 위한 적응적 알고리즘. 정보과학회논문지: 소프트웨어 및 응용, 33(12):1090-1102, 2006
  22. J. Ji, G. Woo, and H. Cho. A source code linearization technique for detecting plagiarized programs. In ITiCSE'07: Proceedings of the 12th annual SIGCSE conference on Innovation and technology in computer science education, pages 73-77, 2007
  23. J. Soon, S. Park, and S. Park. Program plagiarism detection using parse tree kernels. In Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence (PRICAI 2006), volume 4099 of Lecture Notes in Computer Science, pages 1000-1004. Springer, 2006 https://doi.org/10.1007/978-3-540-36668-3_122
  24. 김영철, 김성근, 최종명, 염세훈, 유재우. 구문트리 비교를 통한 프로그램 유형 복제 검사, 한국정보과학회논문지, 30(8):792-802, 2003
  25. 김영철, 유재우. 구문트리에서 키워드 추출을 이용한 프로그램 유사도 평가, 한국정보처리학회논문지, 12-A(2):109-116, 2004 https://doi.org/10.3745/KIPSTA.2005.12A.2.109
  26. T. F. Smith and M. S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology 147, 195-197, 1981 https://doi.org/10.1016/0022-2836(81)90087-5
  27. Karlin, S. and Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoreing schemes. In Proceedings of the National Academy Sciences of the USA. Volume 87. 2264-2268, 1990 https://doi.org/10.1073/pnas.87.6.2264
  28. Altschul, S.F., Bundschuh, R., Olsen, R., and Hwa, T.: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29, 351-361, 2001 https://doi.org/10.1093/nar/29.2.351
  29. Kschischo, M., L¨assig, M., and Yu, Y.K.: Toward an accurate statistics of gapped alignment. Bulletin of Mathematical Biology 67, 169-191, 2005 https://doi.org/10.1016/j.bulm.2004.07.001

Cited by

  1. A Study on Plagiarism Detection and Document Classification Using Association Analysis vol.23, pp.3, 2014, https://doi.org/10.5859/KAIS.2014.23.3.127