DOI QR코드

DOI QR Code

An Efficient DNA Sequence Compression using Small Sequence Pattern Matching

  • Murugan., A (Dr. Ambedkar Government College) ;
  • Punitha., K (Agurchand Manmull Jain College)
  • Received : 2021.08.05
  • Published : 2021.08.30

Abstract

Bioinformatics is formed with a blend of biology and informatics technologies and it employs the statistical methods and approaches for attending the concerning issues in the domains of nutrition, medical research and towards reviewing the living environment. The ceaseless growth of DNA sequencing technologies has resulted in the production of voluminous genomic data especially the DNA sequences thus calling out for increased storage and bandwidth. As of now, the bioinformatics confronts the major hurdle of management, interpretation and accurately preserving of this hefty information. Compression tends to be a beacon of hope towards resolving the aforementioned issues. Keeping the storage efficiently, a methodology has been recommended which for attending the same. In addition, there is introduction of a competent algorithm that aids in exact matching of small pattern. The DNA representation sequence is then implemented subsequently for determining 2 bases to 6 bases matching with the remaining input sequence. This process involves transforming of DNA sequence into an ASCII symbols in the first level and compress by using LZ77 compression method in the second level and after that form the grid variables with size 3 to hold the 100 characters. In the third level of compression, the compressed output is in the grid variables. Hence, the proposed algorithm S_Pattern DNA gives an average better compression ratio of 93% when compared to the existing compression algorithms for the datasets from the UCI repository.

Keywords

References

  1. Murugan A., Lavanya B. and Shyamala K., "A Novel Programming Approach for DNA Computing", International Journal of Computational Intelligence Research, vol. 7(2), pp. 199-209, 2011.
  2. Pothuraju Rajarajeswari and Allam Apparao, "DNABIT Compress - Genome Compression Algorithm, Bioinformation, vol. 5(8), pp. 350-360, 2011. https://doi.org/10.6026/97320630005350
  3. Khairy R., Safar M., and El-Kharashi M.W., "Bloom filter acceleration: A high level synthesis approach". In: Proceedings of IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1-6, Windsor, ON, Canada, 2017.
  4. Heng Li., "BGT: Efficient and flexible genotype query across many samples", Bioinformatics, vol. 32(4), pp. 590-592, 2016. https://doi.org/10.1093/bioinformatics/btv613
  5. Zheng X., "SeqArray-a storage-efficient high-performance data format for WGS variant calls", Bioinformatics, vol. 33(15), pp. 2251-2257, 2017. https://doi.org/10.1093/bioinformatics/btx145
  6. Deorowicz and Sebastian, "FQSqueezer: k-mer-based compression of sequencing data", Scientific reports, vol. 10(1), pp. 1-9, 2020. https://doi.org/10.1038/s41598-019-56847-4
  7. Liu Y., Yu Z. and Li J., "Index suffix-prefix overlaps by (w; k)-minimizer to generate long contigs for reads compression", Bioinformatics, vol. 35(12), pp. 2066-2074, 2018. https://doi.org/10.1093/bioinformatics/bty936
  8. Lau A.K., Dorrer S. and Leimeister C.A., "Read-SpaM: Assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage", BMC Bioinformatics, vol. 20(20), pp. 1-15, 2019. https://doi.org/10.1186/s12859-018-2565-8
  9. Milton Silva, Diogo Pratas and Armando J Pinho, "Efficient DNA sequence compression with neural networks, GigaScience, Vol. 9(11), pp. 1-15, 2020.
  10. Greenfield, "GeneCodeq: quality score compression and improved genotyping using a Bayesian framework", Bioinformatics, vol. 32(20), pp. 3124-3132, 2016 https://doi.org/10.1093/bioinformatics/btw385
  11. Bonfield J. K. and McCarthy, "Crumble: reference free lossy compression of sequence quality values", Bioinformatics, vol. 35(2), pp. 337-339, 2019. https://doi.org/10.1093/bioinformatics/bty608
  12. Du S., Li J. and Bian N., "A compression method for DNA", PLOS ONE Journal, vol. 15(11), Article ID: e0238220, 2020.
  13. Gopinath A. and Ravisankar M, "Comparison of lossless data compression techniques", In: IEEE International Conference on Inventive Computation Technologies (ICICT), pp. 628-633, Coimbatore, India, 2020.
  14. Kavitha P., "A Survey on Lossless and Lossy Data Compression Methods," International Journal of Computer Science Engineering Technology (IJCSET), vol. 7(3), pp. 110-114, 2016.
  15. Nirmala Devi S., Rajagopalan P. and Anuradha V., "Index based multiple pattern matching algorithm using frequent character count in patterns", International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3(5), 2013.
  16. Karel Brinda, "Novel computational techniques for mapping and classifying Next-Generation Sequencing data", PhD thesis, Universite Paris-Est, November 2016.
  17. Bhukya, R., and Somayajulu, D. V. L. N., "Exact multiple pattern matching algorithm using DNA sequence and pattern pair. International Journal of Computer Applications, vol. 17(8), pp. 32-38, 2011. https://doi.org/10.5120/2239-2862
  18. Prashant Pandey and Rob Patro, "Squeakr: an exact and approximate k-mer counting system", Bioinformatics, vol. 34(4), pp. 568-575, 2017. https://doi.org/10.1093/bioinformatics/btx636
  19. Numanagic, I., Bonfield, J. K., Hach, F., Voges, J., Ostermann, J., Alberti, C., and Sahinalp, S. C., "Comparison of high-throughput sequencing data compression tools", nature methods, vol. 13(12), pp. 1005-1008, 2016. https://doi.org/10.1038/nmeth.4037
  20. Wu L., Yavas G., "Direct comparison of performance of single nucleotide variant calling in human genome with alignment-based and assembly-based approaches", Scientific reports, vol. 7(1), pp. 1-9, 2017. https://doi.org/10.1038/s41598-016-0028-x
  21. Danek, A., and Deorowicz, S., "GTC: a novel attempt to maintenance of huge genome collections compressed", BioRxiv, Article ID. 131649, 2017.
  22. Chikhi R., Limasset A., and Medvedev P., "Compacting de Bruijn graphs from sequencing data quickly and in low memory", Bioinformatics, vol. 32(12), pp. 201-208, 2016.
  23. Eric, Pamela Vinitha, Gopakumar Gopalakrishnan and Muralikrishnan Karunakaran, "An optimal seed based compression algorithm for DNA sequences", Advances in Bioinformatics, vol. 2016, Article ID 3528406, pp. 1-7, 2016.
  24. Punitha K. and Murugan A., "Pattern Matching Compression Algorithm for DNA Sequences", In: Proceedings of the International Conference on Sustainable Expert System, vol.176, pp. 387-402, Tribhuvan University, Nepal, 2021.