DOI QR코드

DOI QR Code

An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

  • Karim, Md. Rezaul (Department of Computer Engineering, College of Electronics and Information, Kyung Hee University) ;
  • Rashid, Md. Mamunur (Department of Computer Engineering, College of Electronics and Information, Kyung Hee University) ;
  • Jeong, Byeong-Soo (Department of Computer Engineering, College of Electronics and Information, Kyung Hee University) ;
  • Choi, Ho-Jin (Department of Computer Science, Korea Advanced Institute of Science and Technology)
  • Received : 2012.01.27
  • Accepted : 2012.02.13
  • Published : 2012.03.31

Abstract

Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs that are responsible for similar expression of a group of genes. In order to reduce mining time and complexity, however, most existing sequence mining algorithms either focus on finding short DNA sequences or require explicit specification of sequence lengths in advance. The challenge is to find longer sequences without specifying sequence lengths in advance. In this paper, we propose an efficient approach to mining maximal contiguous frequent patterns from large DNA sequence datasets. The experimental results show that our proposed approach is memory-efficient and mines maximal contiguous frequent patterns within a reasonable time.

Keywords

References

  1. Chvatal V, Sankoff D. Longest common subsequences of two random sequences. J Appl Probab 1975;12:306-315. https://doi.org/10.2307/3212444
  2. Hirschberg DS. Algorithms for the longest common subsequence problem. J Assoc Comput Mach 1977;24:664-675. https://doi.org/10.1145/322033.322044
  3. Huo H, Stojkovic V. A suffix tree construction algorithm for DNA sequences. In: Proceeding of IEEE International Conference on Bioinformatics and Bioengineering (BIBE'07), 2007 Oct 14-17, Boston, MA, pp. 1178-1182.
  4. Tata S, Hankins RA, Patel JM. Practical suffix tree construction. In: Proceeding of 30th International Conference on Very Large Data Bases (VLDB'04), 2004 Aug 29-Sep 3, Toronto, pp. 36-47.
  5. Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceeding of 20th International Conference on Very Large Data Bases (VLDB'94), 1994 Sep 12-15, Santiago de Chile, pp. 487-499.
  6. Srikant R, Agrawal R. Mining sequential patterns: generalizations and performance improvements. In: Proceeding of 5th International Conference on Extending Database Technology (EDBT'96), 1996 Mar 25-29, Avignon, pp. 3-17.
  7. Pei J, Han J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC. PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceeding of IEEE International Conference on Data Engineering (ICDE'01), 2001 Apr 2-6, Heidelberg, pp. 215-224.
  8. Pan J, Wang P, Wang W, Shi B, Yang G. Efficient algorithms for mining maximal frequent concatenate sequences in biological datasets. In: Proceeding of 5th International Conference on Computer and Information Technology (CIT'05), 2005 Sep 21-23, Shanghai, pp. 98-104.
  9. Kang TH, Yoo JS, Kim HY. Mining frequent contiguous sequence patterns in biological sequences. In: Proceeding of 7th IEEE International Conference on Bioinformatics and Bioengineering (BIBE'08), 2008 Oct 8-10, Athens, pp. 723-728.
  10. Zerin SF, Ahmed CF, Tanbeer SK, Jeong BS. A fast indexed- based contiguous sequential pattern mining technique in biological data sequences. In: Proceeding of 2nd International Conference on Emerging Databases (EBD'10), 2010 Aug 30-31, Jeju.
  11. Appice A, Ceci M, Turi A, Malerba D. A parallel, distributed algorithm for relational frequent pattern discovery from very large data sets. Intell Data Anal 2011;15:69-88.
  12. Lin MY, Lee SY. Fast discovery of sequential patterns through memory indexing and database partitioning. J Inf Sci Eng 2005;21:109-128.
  13. Nguyen SN, Orlowska ME. A further study in the data partitioning approach for frequent itemsets mining. In: Proceeding of 17th Australasian Database Conference (ADC'06), 2006 Jan 16-19, Hobart, Tasmania, pp. 31-37.
  14. Totad SG, Geeta RB, Prasanna CR, Santhosh NK, Reddy PV. Scaling data mining algorithms to large and distributed datasets. Intl J Database Manag Syst 2010; 2:26-35. https://doi.org/10.5121/ijdms.2010.2403