DOI QR코드

DOI QR Code

Efficient Mining of Interesting Patterns in Large Biological Sequences

  • Rashid, Md. Mamunur (Department of Computer Engineering, College of Electronics and Information, Kyung Hee University) ;
  • Karim, Md. Rezaul (Department of Computer Engineering, College of Electronics and Information, Kyung Hee University) ;
  • Jeong, Byeong-Soo (Department of Computer Engineering, College of Electronics and Information, Kyung Hee University) ;
  • Choi, Ho-Jin (Department of Computer Science, Korea Advanced Institute of Science and Technology)
  • Received : 2012.01.27
  • Accepted : 2012.02.10
  • Published : 2012.03.31

Abstract

Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.

Keywords

References

  1. Blahut RE. Principles and Practice of Information Theory . Reading: Addison-Wesley Pub. Co., 1987.
  2. Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94), 1994 Sep 12-15, Santiago de Chile, pp. 487-499.
  3. Srikant R, Agrawal R. Mining sequential patterns: generalizations and performance improvements. In: Proceeding of 5th International Conference on Extending Database Technology (EDBT'96), 1996 Mar 25-29, Avignon, pp. 3-17.
  4. Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, et al. PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceeding of IEEE International Conference on Data Engineering (ICDE'01), 2001 Apr 2-6, Heidelberg, pp. 215-224.
  5. Chvátal V, Sankoff D. Longest common subsequences of two random sequences. J Appl Probab 1995;12:306-315.
  6. Farach M. Optimal suffix tree construction with large alphabets. In: Proceedings of IEEE Symposium on Foundations of Computer Science (FOCS'97), 1997 Oct 20-22, Miami Beach, FL, pp. 137-143.
  7. Hirschberg DS. Algorithms for the longest common subsequence problem. J Assoc Comput Mach 1977;24:664-675. https://doi.org/10.1145/322033.322044
  8. McCreight EM. A space-economical suffix tree construction algorithm. J Assoc Comput Mach 1976;23:262-272. https://doi.org/10.1145/321941.321946
  9. Yang J, Wang W, Yu PS. InfoMiner: Mining Surprising Periodic patterns. In: Proceeding of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'01), 2001 Aug 26-29, San Francisco, CA.
  10. Lu Y, Lu S, Fotouhi F, Sun Y, Yang Z, Liang LR. PDC: pattern discovery with confidence in DNA sequences. In: Proceeding of the 2nd IASTED International Conference on Advances in Computer Science and Technology (ACST'06), 2006 Jan 23-25, Puerto Vallarta, pp. 345-350.
  11. Pan J, Wang P, Wang W, Shi B, Yang G. Efficient algorithms for mining maximal frequent concatenate sequences in biological datasets. In: Proceeding of 5th International Conference on Computer and Information Technology (CIT'05), 2005 Sep 21-23, Shanghai, pp. 98-104.
  12. Kang TH, Yoo JS, Kim HY. Mining frequent contiguous sequence patterns in biological sequences. In: Proceedings of 7th IEEE International Conference on Bioinformatics and Bioengineering (BIBE'08), 2008 Oct 8-10, Athens, pp. 723-728.
  13. Zerin SF, Ahmed CF, Tanbeer SK, Jeong BS. A fast indexed- based contiguous sequential pattern mining technique in biological data sequences. In: Proceeding of 2nd International Conference on Emerging Databases (EBD'10), 2010 Aug 30-31, Jeju.
  14. Rashid MM, Karim MR, Hossain MA, Jeong BS. An efficient approach for mining significant contiguous frequent patterns in biological sequences. In: Proceeding of 3rd International Conference on Emerging Databases (EBD'11), 2011 Aug 25-27, Incheon.