Browse > Article
http://dx.doi.org/10.22937/IJCSNS.2021.21.8.26

A Pattern Matching Extended Compression Algorithm for DNA Sequences  

Murugan., A (Dr. Ambedkar Government College)
Punitha., K (Agurchand Manmull Jain College)
Publication Information
International Journal of Computer Science & Network Security / v.21, no.8, 2021 , pp. 196-202 More about this Journal
Abstract
DNA sequencing provides fundamental data in genomics, bioinformatics, biology and many other research areas. With the emergent evolution in DNA sequencing technology, a massive amount of genomic data is produced every day, mainly DNA sequences, craving for more storage and bandwidth. Unfortunately, managing, analyzing and specifically storing these large amounts of data become a major scientific challenge for bioinformatics. Those large volumes of data also require a fast transmission, effective storage, superior functionality and provision of quick access to any record. Data storage costs have a considerable proportion of total cost in the formation and analysis of DNA sequences. In particular, there is a need of highly control of disk storage capacity of DNA sequences but the standard compression techniques unsuccessful to compress these sequences. Several specialized techniques were introduced for this purpose. Therefore, to overcome all these above challenges, lossless compression techniques have become necessary. In this paper, it is described a new DNA compression mechanism of pattern matching extended Compression algorithm that read the input sequence as segments and find the matching pattern and store it in a permanent or temporary table based on number of bases. The remaining unmatched sequence is been converted into the binary form and then it is been grouped into binary bits i.e. of seven bits and gain these bits are been converted into an ASCII form. Finally, the proposed algorithm dynamically calculates the compression ratio. Thus the results show that pattern matching extended Compression algorithm outperforms cutting-edge compressors and proves its efficiency in terms of compression ratio regardless of the file size of the data.
Keywords
DNA sequence; pattern matching; lossless compression; compression ratio;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Ginart, A.A., Hui, J., Zhu, K., Numanagic, I., Courtade, T.A., Sahinalp, S.C. and David, N.T.: "Optimal compressed representation of high throughput sequence data via light assembly". Nature communications, vol. 9(1), pp. 1-9(2018)   DOI
2 Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T. and Ashley, E.: "Effect of lossy compression of quality scores on variant calling". Briefings in bioinformatics, vol. 18(2), pp. 183-194(2017)
3 Punitha K. and Murugan A.: "Pattern Matching Compression Algorithm for DNA Sequences", In: Proceedings of the International Conference on Sustainable Expert System, vol.176, pp. 387-402, Nepal(2021).
4 Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G. and Iqbal, Z.: "Ultrafast search of all deposited bacterial and viral genomic data". Nature biotechnology, vol. 37(2), pp.152-159(2019)   DOI
5 Al-Okaily, A., Almarri, B., Al Yami, S. and Huang, C.H.: "Toward a better compression for DNA sequences using Huffman encoding". Journal of Computational Biology, vol. 24(4), pp.280-288(2017)   DOI
6 Bonfield, J.K., McCarthy, S.A. and Durbin, R.: "Crumble: reference free lossy compression of sequence quality values". Bioinformatics, vol. 35(2), pp. 337-339(2019)   DOI
7 Pratas, D., Pinho, A.J. and Ferreira, P.J.: "Efficient compression of genomic sequences". In Proceedings of Data compression conference (DCC), pp. 231-240, IEEE, USA(2016)
8 Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G. and Iqbal, Z.: "Ultrafast search of all deposited bacterial and viral genomic data". Nature biotechnology, vol. 37(2), pp.152-159(2019)   DOI
9 Chikhi, Rayan, Jan Holub, and Paul Medvedev: "Data structures to represent sets of k-long DNA sequences". arXiv preprint arXiv:1903.12312(2019)
10 Brinda, Karel, Michael Baym, and Gregory Kucherov: "Simplitigs as an efficient and scalable representation of de Bruijn graphs". Genome biology 22, vol. 1, pp.1-24(2021)
11 Solomon, B. and Kingsford, C.: "Fast search of thousands of short-read sequencing experiments". Nature biotechnology, vol. 34(3), pp.300-302(2016)   DOI
12 Khairy, Reem, Mona Safar, and Watheq El-Kharashi. M.: "Bloom filter acceleration: A high level synthesis approach". In the proceedings of 30th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1-6. Canada(2017)
13 Deorowicz, S.: "FQSqueezer: k-mer-based compression of sequencing data". Scientific reports, vol. 10(1), pp.1-9 (2020)   DOI
14 Bingmann, Timo, Phelim Bradley, Florian Gauger, and Zamin Iqbal: "Cobs: a compact bit-sliced signature index. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 285-303. Springer, Cham(2019)
15 Hernaez, M., Pavlichin, D., Weissman, T. and Ochoa, I.: "Genomic data compression". Annual Review of Biomedical Data Science, vol. 2, pp.19-37(2019)   DOI
16 Holley, G., Wittler, R. and Stoye, J.: "Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage". Algorithms for Molecular Biology, vol. 11(1), pp.1-9(2016)   DOI
17 Marchiori, D. and Comin, M.: "SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers". Bioinformatics, pp. 59-67(2017)
18 Chikhi, R., Holub, J. and Medvedev, P.: "Data structures to represent sets of k-long DNA sequences", arXiv preprint arXiv: 1903.12312(2019)
19 Kryukov, K., Ueda, M.T., Nakagawa, S. and Imanishi, T.: "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, vol. 35(19), pp. 3826-3828(2019)   DOI
20 Hosseini, M., Pratas, D. and Pinho, A.J.: "A survey on data compression methods for biological sequences". Information, vol. 7(4), p.56(2016)   DOI
21 Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. and Weissman, T.: "SPRING: a next-generation compressor for FASTQ data". Bioinformatics, vol. 35(15), pp. 2674-2676(2019)   DOI
22 Liu, Y., Yu, Z., Dinger, M.E. and Li, J.: "Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression". Bioinformatics, vol. 35(12), pp.2066-2074(2019)   DOI
23 Long, H., Sung, W., Kucukyildirim, S., Williams, E., Miller, S.F., Guo, W., Patterson, C., Gregory, C., Strauss, C., Stone, C. and Berne, C.: "Evolutionary determinants of genome-wide nucleotide composition". Nature ecology & evolution, vol. 2(2), pp.237-240(2018)   DOI
24 Kryukov, K., Ueda, M.T., Nakagawa, S. and Imanishi, T.: "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, vol. 35(19), pp.3826-3828(2019)   DOI
25 Chandak, S., Tatwawadi, K. and Weissman, T.: "Compression of genomic sequencing reads via hashbased reordering: algorithm and analysis". Bioinformatics, vol. 34(4), pp. 558- 567(2018)   DOI
26 Pamela Vinitha, E., Gopalakrishnan, G. and Karunakaran, M.: "An optimal seed based compression algorithm for DNA sequences". Advances in Bioinformatics, vol. 2016, Article ID 3528406(2016)
27 Hernaez, M., Ochoa, I. and Weissman, T.: "A cluster-based approach to compression of quality scores". In 2016 Data Compression Conference (DCC), pp. 261-270, IEEE, USA(2016)
28 Pratas, D., Hosseini, M., Silva, J.M. and Pinho, A.J.: "A reference-free lossless compression algorithm for DNA sequences using a competitive prediction of two classes of weighted models". Entropy, vol. 21(11), p.1074(2019)   DOI