Applying Genomic Sequence Alignment Methodology for Source Codes Plagiarism Detection

유전체 서열의 정렬 기법을 이용한 소스 코드 표절 검사

  • 강은미 (부산대학교 전자계산학과) ;
  • 황미녕 (한국과학기술정보연구원) ;
  • 조환규 (부산대학교 정보컴퓨터공학부)
  • Published : 2003.06.01

Abstract

The syntactic and semantic characteristics of a computer program can be represented by the keywords sequence extracted from the source code. Therefore the similarity and the difference between two programs can be clearly figured out by comparing the keyword sequences obtained from the given programs. Various methods for measuring the similarity of two different sequences have been intensively studied already in bioinformatics on biological genetic sequence manipulation. In this paper, we propose a new method for measuring the similarity of two different programs and detecting the partial plagiarism by exploiting the sequence alignment techniques. In order to evaluate the performance of the proposed method, we experimented with the actual Program codes submitted by 70 students attending a Data Structure course )tow 2001. The experimental results show that the proposed method is more effective and powerful than the fingerprint method which is the most commonly used for the Plagiarism detection.

일반적인 컴퓨터 프로그램의 구성적, 구문적 특징은 소스 코드로부터 추출한 키워드들의 서열로 나타낼 수 있다. 따라서 추출한 키워드의 서열을 비교하면 두 프로그램의 유사성과 상이점에 대해서 잘 파악할 수 있다. 서열의 유사성을 측정하는 여러 가지 방법은 생물학적 유전자 서열을 다루는 생물정보학에서 활발한 연구가 이루어져왔다. 본 논문에서 우리는 두 프로그램간의 유사성을 측정하고 서열 정렬 방법을 이용하여 부분 표절 검출을 하는 새로운 방법을 제안한다. 제시한 방법의 성능을 평가하기 위해서, 2001년 자료구조 수업에 참석한 수강생들이 제출한 프로그램을 실험 데이타로 사용하여 표절을 검사하였다. 실험결과는 제안된 기법이 표절 검사에 있어 가장 널리 사용되는 지문법(fingerprint)보다 더 효과적임을 보여 주었다.

Keywords

References

  1. http://www.calstatela.edu/centers/write_cn/plagiarism.htm
  2. http://www.rbsz.com/plag.htm
  3. http://www.gyosuclub.com/
  4. Tak W.Y. and Hector. G., 'Duplicate detection in information dissemination,' Proc. Very Large Databases Conference, pp. 66-77, 1995
  5. Alan P. and James O.H., 'Computer algorithms for Plagiarism Detection,' IEEE Transactions on Education, Vol.32, No.2, pp. 94-99, 1989 https://doi.org/10.1109/13.28038
  6. http://www.plagiarism.org
  7. http://www.integriguard.com
  8. http://www.canexus.com/eve/abouteve.shtml
  9. http://www.copycatch.freeserve.co.uk
  10. http://www.wordcheksystems.com/
  11. Sergey B, James D. and H.G, 'Copy detection mechanisms for digital documents,' Proc. ACM SIGMOD International conference on Management of data, pp. 398-409, 1995 https://doi.org/10.1145/223784.223855
  12. http://www.few.vu.nl/~dick/sim.html
  13. http://glimpse.arizona.edu/javadup.html
  14. Antonio. S., Hong V.L., and Rynson. W.H.L., 'CHECK: A document plagiarism detection system,' Proc. ACM Symposium on Applied Computing, pp. 70-77, 1997 https://doi.org/10.1145/331697.335176
  15. Whale, 'Identification of Program Similarity in Large populations,' The Computer Journal, Vol.33, No.2, pp. 140-146, 1990 https://doi.org/10.1093/comjnl/33.2.140
  16. Michael. J.W., 'Detection of similarities in student programs: YAP'ing may be preferable to Plague'ing,' Proc. SIGSCI Technical Symposium, pp. 268-271, 1992
  17. Michael. J.W., 'YAP3: improved detection of similarities in computer programs and other texts,' Proc. SIGCSE'96, pp. 130-134, 1996 https://doi.org/10.1145/236452.236525
  18. http://ftp.cs.berkeley.edu/~aiken/moss.html
  19. http://wwwipd.ira.uka.de:2222/
  20. 이광근 교수와의 서신, private communication
  21. 조환규, 'Genomic Sequence alignment and its application for Computing Linear Structure Similarity,' 2002년 제 1차 한국생물정보학회 워크샵, 2. 2002
  22. http://www2.ebi.ac.uk/clustalw/
  23. Julie D.T.,Desmond G.H., and Toby. J.G., 'CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice,' Nucleic Acids Res. Vol,22, No.22, pp. 4673-4680, 1994 https://doi.org/10.1093/nar/22.22.4673
  24. Jeong-Hyeon C., Ho-Youl J.. Hey-Sun K. and Hwan-Gue C., 'PhyloDraw: a phylogenetic tree drawing system,' Bioinformatics, Vol.16, No.11 , pp. 1056-1058, 2000 https://doi.org/10.1093/bioinformatics/16.11.1056