[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.22937/IJCSNS.2021.21.8.26

A Pattern Matching Extended Compression Algorithm for DNA Sequences

Murugan., A (Dr. Ambedkar Government College)
Punitha., K (Agurchand Manmull Jain College)

Publication Information

International Journal of Computer Science & Network Security / v.21, no.8, 2021 , pp. 196-202 More about this Journal

Abstract

DNA sequencing provides fundamental data in genomics, bioinformatics, biology and many other research areas. With the emergent evolution in DNA sequencing technology, a massive amount of genomic data is produced every day, mainly DNA sequences, craving for more storage and bandwidth. Unfortunately, managing, analyzing and specifically storing these large amounts of data become a major scientific challenge for bioinformatics. Those large volumes of data also require a fast transmission, effective storage, superior functionality and provision of quick access to any record. Data storage costs have a considerable proportion of total cost in the formation and analysis of DNA sequences. In particular, there is a need of highly control of disk storage capacity of DNA sequences but the standard compression techniques unsuccessful to compress these sequences. Several specialized techniques were introduced for this purpose. Therefore, to overcome all these above challenges, lossless compression techniques have become necessary. In this paper, it is described a new DNA compression mechanism of pattern matching extended Compression algorithm that read the input sequence as segments and find the matching pattern and store it in a permanent or temporary table based on number of bases. The remaining unmatched sequence is been converted into the binary form and then it is been grouped into binary bits i.e. of seven bits and gain these bits are been converted into an ASCII form. Finally, the proposed algorithm dynamically calculates the compression ratio. Thus the results show that pattern matching extended Compression algorithm outperforms cutting-edge compressors and proves its efficiency in terms of compression ratio regardless of the file size of the data.

Keywords

DNA sequence; pattern matching; lossless compression; compression ratio;

Citations & Related Records

Reference

1	Ginart, A.A., Hui, J., Zhu, K., Numanagic, I., Courtade, T.A., Sahinalp, S.C. and David, N.T.: "Optimal compressed representation of high throughput sequence data via light assembly". Nature communications, vol. 9(1), pp. 1-9(2018) DOI
2	Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T. and Ashley, E.: "Effect of lossy compression of quality scores on variant calling". Briefings in bioinformatics, vol. 18(2), pp. 183-194(2017)
3	Punitha K. and Murugan A.: "Pattern Matching Compression Algorithm for DNA Sequences", In: Proceedings of the International Conference on Sustainable Expert System, vol.176, pp. 387-402, Nepal(2021).
4	Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G. and Iqbal, Z.: "Ultrafast search of all deposited bacterial and viral genomic data". Nature biotechnology, vol. 37(2), pp.152-159(2019) DOI
5	Bonfield, J.K., McCarthy, S.A. and Durbin, R.: "Crumble: reference free lossy compression of sequence quality values". Bioinformatics, vol. 35(2), pp. 337-339(2019) DOI
6	Pratas, D., Pinho, A.J. and Ferreira, P.J.: "Efficient compression of genomic sequences". In Proceedings of Data compression conference (DCC), pp. 231-240, IEEE, USA(2016)
7	Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G. and Iqbal, Z.: "Ultrafast search of all deposited bacterial and viral genomic data". Nature biotechnology, vol. 37(2), pp.152-159(2019) DOI
8	Chikhi, Rayan, Jan Holub, and Paul Medvedev: "Data structures to represent sets of k-long DNA sequences". arXiv preprint arXiv:1903.12312(2019)
9	Brinda, Karel, Michael Baym, and Gregory Kucherov: "Simplitigs as an efficient and scalable representation of de Bruijn graphs". Genome biology 22, vol. 1, pp.1-24(2021)
10	Al-Okaily, A., Almarri, B., Al Yami, S. and Huang, C.H.: "Toward a better compression for DNA sequences using Huffman encoding". Journal of Computational Biology, vol. 24(4), pp.280-288(2017) DOI
11	Solomon, B. and Kingsford, C.: "Fast search of thousands of short-read sequencing experiments". Nature biotechnology, vol. 34(3), pp.300-302(2016) DOI
12	Khairy, Reem, Mona Safar, and Watheq El-Kharashi. M.: "Bloom filter acceleration: A high level synthesis approach". In the proceedings of 30th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1-6. Canada(2017)
13	Deorowicz, S.: "FQSqueezer: k-mer-based compression of sequencing data". Scientific reports, vol. 10(1), pp.1-9 (2020) DOI
14	Bingmann, Timo, Phelim Bradley, Florian Gauger, and Zamin Iqbal: "Cobs: a compact bit-sliced signature index. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 285-303. Springer, Cham(2019)
15	Holley, G., Wittler, R. and Stoye, J.: "Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage". Algorithms for Molecular Biology, vol. 11(1), pp.1-9(2016) DOI
16	Marchiori, D. and Comin, M.: "SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers". Bioinformatics, pp. 59-67(2017)
17	Chikhi, R., Holub, J. and Medvedev, P.: "Data structures to represent sets of k-long DNA sequences", arXiv preprint arXiv: 1903.12312(2019)
18	Hernaez, M., Pavlichin, D., Weissman, T. and Ochoa, I.: "Genomic data compression". Annual Review of Biomedical Data Science, vol. 2, pp.19-37(2019) DOI
19	Hosseini, M., Pratas, D. and Pinho, A.J.: "A survey on data compression methods for biological sequences". Information, vol. 7(4), p.56(2016) DOI
20	Kryukov, K., Ueda, M.T., Nakagawa, S. and Imanishi, T.: "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, vol. 35(19), pp. 3826-3828(2019) DOI
21	Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. and Weissman, T.: "SPRING: a next-generation compressor for FASTQ data". Bioinformatics, vol. 35(15), pp. 2674-2676(2019) DOI
22	Liu, Y., Yu, Z., Dinger, M.E. and Li, J.: "Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression". Bioinformatics, vol. 35(12), pp.2066-2074(2019) DOI
23	Kryukov, K., Ueda, M.T., Nakagawa, S. and Imanishi, T.: "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, vol. 35(19), pp.3826-3828(2019) DOI
24	Chandak, S., Tatwawadi, K. and Weissman, T.: "Compression of genomic sequencing reads via hashbased reordering: algorithm and analysis". Bioinformatics, vol. 34(4), pp. 558- 567(2018) DOI
25	Pamela Vinitha, E., Gopalakrishnan, G. and Karunakaran, M.: "An optimal seed based compression algorithm for DNA sequences". Advances in Bioinformatics, vol. 2016, Article ID 3528406(2016)
26	Hernaez, M., Ochoa, I. and Weissman, T.: "A cluster-based approach to compression of quality scores". In 2016 Data Compression Conference (DCC), pp. 261-270, IEEE, USA(2016)
27	Pratas, D., Hosseini, M., Silva, J.M. and Pinho, A.J.: "A reference-free lossless compression algorithm for DNA sequences using a competitive prediction of two classes of weighted models". Entropy, vol. 21(11), p.1074(2019) DOI
28	Long, H., Sung, W., Kucukyildirim, S., Williams, E., Miller, S.F., Guo, W., Patterson, C., Gregory, C., Strauss, C., Stone, C. and Berne, C.: "Evolutionary determinants of genome-wide nucleotide composition". Nature ecology & evolution, vol. 2(2), pp.237-240(2018) DOI