DOI QR코드

DOI QR Code

A Pattern Matching Extended Compression Algorithm for DNA Sequences

  • Murugan., A (Dr. Ambedkar Government College) ;
  • Punitha., K (Agurchand Manmull Jain College)
  • Received : 2021.08.05
  • Published : 2021.08.30

Abstract

DNA sequencing provides fundamental data in genomics, bioinformatics, biology and many other research areas. With the emergent evolution in DNA sequencing technology, a massive amount of genomic data is produced every day, mainly DNA sequences, craving for more storage and bandwidth. Unfortunately, managing, analyzing and specifically storing these large amounts of data become a major scientific challenge for bioinformatics. Those large volumes of data also require a fast transmission, effective storage, superior functionality and provision of quick access to any record. Data storage costs have a considerable proportion of total cost in the formation and analysis of DNA sequences. In particular, there is a need of highly control of disk storage capacity of DNA sequences but the standard compression techniques unsuccessful to compress these sequences. Several specialized techniques were introduced for this purpose. Therefore, to overcome all these above challenges, lossless compression techniques have become necessary. In this paper, it is described a new DNA compression mechanism of pattern matching extended Compression algorithm that read the input sequence as segments and find the matching pattern and store it in a permanent or temporary table based on number of bases. The remaining unmatched sequence is been converted into the binary form and then it is been grouped into binary bits i.e. of seven bits and gain these bits are been converted into an ASCII form. Finally, the proposed algorithm dynamically calculates the compression ratio. Thus the results show that pattern matching extended Compression algorithm outperforms cutting-edge compressors and proves its efficiency in terms of compression ratio regardless of the file size of the data.

Keywords

References

  1. Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G. and Iqbal, Z.: "Ultrafast search of all deposited bacterial and viral genomic data". Nature biotechnology, vol. 37(2), pp.152-159(2019) https://doi.org/10.1038/s41587-018-0010-1
  2. Chikhi, Rayan, Jan Holub, and Paul Medvedev: "Data structures to represent sets of k-long DNA sequences". arXiv preprint arXiv:1903.12312(2019)
  3. Brinda, Karel, Michael Baym, and Gregory Kucherov: "Simplitigs as an efficient and scalable representation of de Bruijn graphs". Genome biology 22, vol. 1, pp.1-24(2021)
  4. Kryukov, K., Ueda, M.T., Nakagawa, S. and Imanishi, T.: "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, vol. 35(19), pp.3826-3828(2019) https://doi.org/10.1093/bioinformatics/btz144
  5. Al-Okaily, A., Almarri, B., Al Yami, S. and Huang, C.H.: "Toward a better compression for DNA sequences using Huffman encoding". Journal of Computational Biology, vol. 24(4), pp.280-288(2017) https://doi.org/10.1089/cmb.2016.0151
  6. Solomon, B. and Kingsford, C.: "Fast search of thousands of short-read sequencing experiments". Nature biotechnology, vol. 34(3), pp.300-302(2016) https://doi.org/10.1038/nbt.3442
  7. Khairy, Reem, Mona Safar, and Watheq El-Kharashi. M.: "Bloom filter acceleration: A high level synthesis approach". In the proceedings of 30th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1-6. Canada(2017)
  8. Deorowicz, S.: "FQSqueezer: k-mer-based compression of sequencing data". Scientific reports, vol. 10(1), pp.1-9 (2020) https://doi.org/10.1038/s41598-019-56847-4
  9. Bingmann, Timo, Phelim Bradley, Florian Gauger, and Zamin Iqbal: "Cobs: a compact bit-sliced signature index. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 285-303. Springer, Cham(2019)
  10. Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G. and Iqbal, Z.: "Ultrafast search of all deposited bacterial and viral genomic data". Nature biotechnology, vol. 37(2), pp.152-159(2019) https://doi.org/10.1038/s41587-018-0010-1
  11. Holley, G., Wittler, R. and Stoye, J.: "Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage". Algorithms for Molecular Biology, vol. 11(1), pp.1-9(2016) https://doi.org/10.1186/s13015-016-0063-y
  12. Marchiori, D. and Comin, M.: "SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers". Bioinformatics, pp. 59-67(2017)
  13. Chikhi, R., Holub, J. and Medvedev, P.: "Data structures to represent sets of k-long DNA sequences", arXiv preprint arXiv: 1903.12312(2019)
  14. Kryukov, K., Ueda, M.T., Nakagawa, S. and Imanishi, T.: "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, vol. 35(19), pp. 3826-3828(2019) https://doi.org/10.1093/bioinformatics/btz144
  15. Pratas, D., Pinho, A.J. and Ferreira, P.J.: "Efficient compression of genomic sequences". In Proceedings of Data compression conference (DCC), pp. 231-240, IEEE, USA(2016)
  16. Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. and Weissman, T.: "SPRING: a next-generation compressor for FASTQ data". Bioinformatics, vol. 35(15), pp. 2674-2676(2019) https://doi.org/10.1093/bioinformatics/bty1015
  17. Liu, Y., Yu, Z., Dinger, M.E. and Li, J.: "Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression". Bioinformatics, vol. 35(12), pp.2066-2074(2019) https://doi.org/10.1093/bioinformatics/bty936
  18. Hernaez, M., Ochoa, I. and Weissman, T.: "A cluster-based approach to compression of quality scores". In 2016 Data Compression Conference (DCC), pp. 261-270, IEEE, USA(2016)
  19. Pratas, D., Hosseini, M., Silva, J.M. and Pinho, A.J.: "A reference-free lossless compression algorithm for DNA sequences using a competitive prediction of two classes of weighted models". Entropy, vol. 21(11), p.1074(2019) https://doi.org/10.3390/e21111074
  20. Long, H., Sung, W., Kucukyildirim, S., Williams, E., Miller, S.F., Guo, W., Patterson, C., Gregory, C., Strauss, C., Stone, C. and Berne, C.: "Evolutionary determinants of genome-wide nucleotide composition". Nature ecology & evolution, vol. 2(2), pp.237-240(2018) https://doi.org/10.1038/s41559-017-0425-y
  21. Hernaez, M., Pavlichin, D., Weissman, T. and Ochoa, I.: "Genomic data compression". Annual Review of Biomedical Data Science, vol. 2, pp.19-37(2019) https://doi.org/10.1146/annurev-biodatasci-072018-021229
  22. Hosseini, M., Pratas, D. and Pinho, A.J.: "A survey on data compression methods for biological sequences". Information, vol. 7(4), p.56(2016) https://doi.org/10.3390/info7040056
  23. Bonfield, J.K., McCarthy, S.A. and Durbin, R.: "Crumble: reference free lossy compression of sequence quality values". Bioinformatics, vol. 35(2), pp. 337-339(2019) https://doi.org/10.1093/bioinformatics/bty608
  24. Chandak, S., Tatwawadi, K. and Weissman, T.: "Compression of genomic sequencing reads via hashbased reordering: algorithm and analysis". Bioinformatics, vol. 34(4), pp. 558- 567(2018) https://doi.org/10.1093/bioinformatics/btx639
  25. Ginart, A.A., Hui, J., Zhu, K., Numanagic, I., Courtade, T.A., Sahinalp, S.C. and David, N.T.: "Optimal compressed representation of high throughput sequence data via light assembly". Nature communications, vol. 9(1), pp. 1-9(2018) https://doi.org/10.1038/s41467-017-02088-w
  26. Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T. and Ashley, E.: "Effect of lossy compression of quality scores on variant calling". Briefings in bioinformatics, vol. 18(2), pp. 183-194(2017)
  27. Pamela Vinitha, E., Gopalakrishnan, G. and Karunakaran, M.: "An optimal seed based compression algorithm for DNA sequences". Advances in Bioinformatics, vol. 2016, Article ID 3528406(2016)
  28. Punitha K. and Murugan A.: "Pattern Matching Compression Algorithm for DNA Sequences", In: Proceedings of the International Conference on Sustainable Expert System, vol.176, pp. 387-402, Nepal(2021).