DOI QR코드

DOI QR Code

A Study on Text Pattern Analysis Applying Discrete Fourier Transform - Focusing on Sentence Plagiarism Detection -

이산 푸리에 변환을 적용한 텍스트 패턴 분석에 관한 연구 - 표절 문장 탐색 중심으로 -

  • 이정송 (전북대학교 전자정보공학부) ;
  • 박순철 (전북대학교 컴퓨터공학부)
  • Received : 2017.02.03
  • Accepted : 2017.03.09
  • Published : 2017.04.30

Abstract

Pattern Analysis is One of the Most Important Techniques in the Signal and Image Processing and Text Mining Fields. Discrete Fourier Transform (DFT) is Generally Used to Analyzing the Pattern of Signals and Images. We thought DFT could also be used on the Analysis of Text Patterns. In this Paper, DFT is Firstly Adapted in the World to the Sentence Plagiarism Detection Which Detects if Text Patterns of a Document Exist in Other Documents. We Signalize the Texts Converting Texts to ASCII Codes and Apply the Cross-Correlation Method to Detect the Simple Text Plagiarisms such as Cut-and-paste, term Relocations and etc. WordNet is using to find Similarities to Detect the Plagiarism that uses Synonyms, Translations, Summarizations and etc. The Data set, 2013 Corpus, Provided by PAN Which is the One of Well-known Workshops for Text Plagiarism is used in our Experiments. Our Method are Fourth Ranked Among the Eleven most Outstanding Plagiarism Detection Methods.

패턴 분석은 신호 및 영상 처리와 텍스트 마이닝 분야에서 가장 중요한 기술 중 하나이다. 이산 푸리에 변환(Discrete Fourier Transform: DFT)은 일반적으로 신호와 영상의 패턴을 분석하는데 사용된다. 본 논문에서는 DFT가 텍스트 패턴 분석에도 적용될 수 있음을 가정하고 문서의 텍스트 패턴이 다른 문서에서도 존재하는지를 탐색하는 표절 문장 탐색에 세계 최초로 적용하였다. 이를 위해 텍스트를 ASCII 코드로 변환하여 신호화하고 복사/붙여넣기, 용어의 재배치 등 단순한 표절 형태의 탐색은 Cross-Correlation(상호상관)을 이용하였다. 또한 유의어를 사용하거나 번역 및 요약 등의 표절 형태를 탐색하기 위해 워드넷(WordNet) 유사도를 사용하였다. 실험을 위해 표절 탐색 분야의 저명한 워크숍인 PAN에서 제공하는 공식적인 데이터 셋(2013 Corpus)을 사용하였으며, 실험 결과 11개의 표절 문장 탐색 기법 중 4번째로 우수한 성능을 보였다.

Keywords

References

  1. Cetin, E., Morling, R. C., and Kale, I. "An Integrated 256-Point Complex FFT Processor for Real-Time Spectrum Analysis and Measurement," Instrumentation and Measurement Technology Conference, pp. 96-101, 1997.
  2. Briggs, W. L. and Henson, V. E, The DFT: an Owners' Manual for the Discrete Fourier Transform, Society for Industrial and Applied Mathematics, 1995.
  3. Howell, K. B., Principles of Fourier Analysis, CRC Press, 2001.
  4. Lynn, P. A. and Fuerst, W., Introductory Digital Signal Processing with Computer Applications, John Wiley, 1998.
  5. Lee, C. H., "A Pattern Matching Algorithm using Correlation in Fourier Domain," Journal of Korea Multimedia Society, Vol. 7, No. 9, pp. 1255-1262, 2004.
  6. Han, J. Y., Cho, C. H., and Son, I. S., "An Empirical Study on Corporate use of Big Data : The Case of Integrated Customer Log System at a Korean Home Shopping Firm," Journal of Internet Electronic Commerce Research, Vol. 15, No. 6, pp. 1-19, 2015. https://doi.org/10.1007/s10660-015-9173-8
  7. Hwang, I. S., "A Study on Plagiarism Detection and Document Classification using Association Analysis," Journal of Information Systems, Vol. 23, No. 3, pp. 127-142, 2014. https://doi.org/10.5859/KAIS.2014.23.3.127
  8. Lyon, C., Malcolm, J., and Dickerson, B., "Detecting Short Passages of Similar Text in Large Document Collections," International Conference on Empirical Methods in Natural Language Processing, pp. 118-125, 2001.
  9. Lewis, J. P., “Fast Template Matching,” Vision Interface, Vol. 95, No. 120123, pp. 15-19, 1995.
  10. Smith, J. O., Mathematics of the Discrete Fourier Transform (DFT): with Audio Applications, Julius Smith, 2007.
  11. Miller, G. A., “WordNet: A Lexical Database for English,” Communications of the ACM, Vol. 38, No. 11, pp. 39-41, 1995. https://doi.org/10.1145/219717.219748
  12. Wu, Z. and Palmer, M., "Verbs Semantics and Lexical Selection," 32nd Annual Meeting on Association for Computational Linguistics, pp. 133-138, 1994.
  13. Resnik, P., "Using Information Content to Evaluate Semantic Similarity in a Taxonomy," 14th International Joint Conference on Artificial Intelligence, 1995.
  14. Jiang, J. J. and Conrath, D. W., "Semantic Similarity based on Corpus Statistics and Lexical Taxonomy," International Conference Research on Computational Linguistics, 1997.
  15. Leacock, C. and Chodorow, M., “Combining Local Context and Wordnet Similarity for Word Sense Identification,” WordNet: An Electronic Lexical Database, Vol. 49, No. 2, pp, 265-283, 1998.
  16. D., "An Information-Theoretic Definition of Similarity," International Conference on Machine Learning, Vol. 98, pp. 296-304, 1998.
  17. Banerjee, S. and Pedersen, T., "An Adapted Lesk Algorithm for Word Sense Disambiguation using WordNet," International Conference on Intelligent Text Processing and Computational Linguistics, pp. 136-145, 2002.
  18. Cheema, W. A., Najib, F., Ahmed, S., Bukhari, S. H., Sittar, A., and Nawab, R. M. A, "A Corpus for Analyzing Text Reuse by People of Different Groups," 5th International Conference of the CLEF Initiative, 2014.
  19. Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., Rosso, P., and Stein, B., "Overview of the 5th International Competition on Plagiarism Detection," Conference on Multilingual and Multimodal Information Access Evaluation, pp. 301-331, 2013.
  20. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P., "An Evaluation Framework for Plagiarism Detection," 23rd International Conference on Computational Linguistics, pp. 997-1005, 2010.
  21. Lee, J. K. and Kim K. J., "Educational Contents and Implementation Procedures of the Training System for Research Ethics", Journal of the Korea Industrial Information Systems Research, Vol. 15, No. 5, pp. 235-246, 2010.
  22. Lee, J. S. and Park S. C., "The Document Clustering using Multi-Objective Genetic Algorithms", Journal of the Korea Industrial Information Systems Research, Vol. 17, No. 2, pp. 57-64, 2012. https://doi.org/10.9723/jksiis.2012.17.2.057
  23. Choi, L. C., Park S. C., and Song, W., "Comparison of Document Clustering algorithm using Genetic Algorithms by Individual Structures", Journal of the Korea Industrial Information Systems Research, Vol. 16, No. 3, pp. 47-56, 2011. https://doi.org/10.9723/jksiis.2011.16.3.047

Cited by

  1. Fine Tactile Representation of Materials for Virtual Reality vol.2020, pp.None, 2017, https://doi.org/10.1155/2020/7296204