A Similarity Measurement and Visualization Method for the Analysis of Program Code

Lee, Youngjoo;Lee, Jeongjin;

doi:10.9717/kmms.2013.16.7.802

Journal of Korea Multimedia Society (한국멀티미디어학회논문지)

Volume 16 Issue 7
/
Pages.802-809
/
2013
/
1229-7771(pISSN)
/
2384-0102(eISSN)

Korea Multimedia Society (한국멀티미디어학회)

DOI QR Code

A Similarity Measurement and Visualization Method for the Analysis of Program Code

프로그램 코드 분석을 위한 유사도 측정 및 가시화 기법

이영주 (삼성전자 생산기술연구소) ;
이정진 (숭실대학교 컴퓨터학부)

Received : 2013.04.10
Accepted : 2013.05.28
Published : 2013.07.31

https://doi.org/10.9717/kmms.2013.16.7.802 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose the similarity measurement method between two program codes by counting the frequency and length of continuous patterns of specifiers and keywords, which exist in two program codes. In addition, we propose the visualization method of this analysis result by formal concept analysis. Proposed method considers adjacencies of specifiers or keywords, which have not been considered in the previous similarity measurements. Proposed method can detect the plagiarism by analyzing the pattern in each function regardless of the order of function call and execution. In addition, the result of the similarity measurement is visualized by the lattice of formal concept analysis to increase the user understanding about the relations between program codes. Experimental results showed that proposed method succeeded in 96% plagiarism detections. Our method could be applied into the analysis of general documents.

본 논문에서는 프로그래밍 언어에 정의되는 지정자와 키워드가 프로그램 코드 상에서 연속적인 패턴으로 나타나게 될 때, 해당 연속 패턴들의 빈도와 길이를 측정하여 두 코드 사이의 유사성을 측정하는 기법을 제안한다. 또한, 이러한 분석 결과를 정형적 개념 분석 기법을 이용하여 가시화하는 기법을 제안한다. 제안 기법은 기존의 유사도 측정 기법에서는 고려하지 않았던 단어 인접성을 유사도 측정에 반영한다. 함수 단위로 지정자와 키워드 패턴을 이용하여 함수의 호출 순서나 수행 순서에 상관없이 표절을 탐지할 수 있다. 또한, 유사도 측정 결과는 정형적 개념 분석 기법을 이용하여 격자(lattice)로 시각화되어 사용자의 이해도를 높일 수 있다. 실험 결과 제안 기법은 96%의 표절 탐지 성공률을 보여주었다. 제안 기법은 프로그램 코드 뿐만 아니라 일반 문서의 분석에도 적용될 수 있다.

Keywords

References

A. Barabasi, R. Albert, and H. Jeong, "Scalefree Characteristics of Random Networks: the Topology of the World-wide Web," Physica, Vol. 281, No. 1, pp. 69-77, 2000. https://doi.org/10.1016/S0378-4371(00)00018-2
김영철, 최재영, "구문트리에서 키워드 추출을 이용한 프로그램 유사도 평가," 정보처리학회 논문지, 제12권, 제2호, pp. 109-116, 2005. https://doi.org/10.3745/KIPSTA.2005.12A.2.109
손기락, 문승미, "계층적 군집화 기법을 이용한 소스 코드 표절 검사," 정보교육학회논문지, 제11권, 제1호, pp. 91-98, 2007.
김은혜, 이송아, 허준, 한경숙, 오용철, "자바소스코드 유사도 측정 시스템," 한국정보과학회 학술발표논문집, 제34권, 제2호, pp. 536-539, 2007.
D. Grune and M. Huntjens, "Het Detecteren van Kopieen bij Informatica-practica," Informatie, Vol. 31, No. 11, pp. 864-867, 1989.
한소정, 용환승, "오픈 소스코드 표절 탐지 기법," 한국정보처리학회 추계학술발표대회 논문집, 제15권, 제2호, pp. 1459-1461, 2008.
한소정, 오픈 소스코드 표절 탐지 기법, 이화여자대학교 석사논문, 2009.
A. Si, H.V. Leong, and R.W.H. Lau, "CHECK: a Document Plagiarism Detection System," Proc. the 1997 ACM Symposium on Applied Computing, pp. 70-77, 1997.
손정우, 박성배, 이상조, 박세영, "Parse tree kernel을 이용한 소스코드 표절 검출," 한국컴퓨터종합학술대회 논문지, 제33권, 제1호, pp. 157- 159, 2006.
K.M. Hammouda and M.S. Kamel, "Efficient Phrase-Based Document Indexing for Web Document Clustering," IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 10, pp. 1279-1296, 2004. https://doi.org/10.1109/TKDE.2004.58
이정진, 이호, 김정곤, 이창경, 신영길, 이윤철, 이민선, "동적 MR 영상에서 비강체 정합과 감산 기법을 이용한 자동 전립선 분할 기법," 멀티미디어학회논문지, 제14권, 제3호, pp. 348-355, 2011.
P. Boucher-Ryan and D. Bridge, "Collaborative Recommending using Formal Concept Analysis," Research and Development in Intelligent Systems XXII , Vol. 19, No. 1, pp. 205-218, 2006.
S.A. Yevtushenko, "System of Data Analysis Concept Explorer," Proc. the 7th national conference on Artificial Intelligence KII-2000, p. 127-134, 2000.