A Pattern Matching Extended Compression Algorithm for DNA Sequences

Murugan., A;Punitha., K;

doi:10.22937/IJCSNS.2021.21.8.26

International Journal of Computer Science & Network Security

Volume 21 Issue 8
/
Pages.196-202
/
2021
/
1738-7906(pISSN)

International Journal of Computer Science & Network Security (국제컴퓨터통신보호논문지학회)

DOI QR Code

A Pattern Matching Extended Compression Algorithm for DNA Sequences

Murugan., A (Dr. Ambedkar Government College) ;
Punitha., K (Agurchand Manmull Jain College)

Received : 2021.08.05
Published : 2021.08.30

https://doi.org/10.22937/IJCSNS.2021.21.8.26 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

DNA sequencing provides fundamental data in genomics, bioinformatics, biology and many other research areas. With the emergent evolution in DNA sequencing technology, a massive amount of genomic data is produced every day, mainly DNA sequences, craving for more storage and bandwidth. Unfortunately, managing, analyzing and specifically storing these large amounts of data become a major scientific challenge for bioinformatics. Those large volumes of data also require a fast transmission, effective storage, superior functionality and provision of quick access to any record. Data storage costs have a considerable proportion of total cost in the formation and analysis of DNA sequences. In particular, there is a need of highly control of disk storage capacity of DNA sequences but the standard compression techniques unsuccessful to compress these sequences. Several specialized techniques were introduced for this purpose. Therefore, to overcome all these above challenges, lossless compression techniques have become necessary. In this paper, it is described a new DNA compression mechanism of pattern matching extended Compression algorithm that read the input sequence as segments and find the matching pattern and store it in a permanent or temporary table based on number of bases. The remaining unmatched sequence is been converted into the binary form and then it is been grouped into binary bits i.e. of seven bits and gain these bits are been converted into an ASCII form. Finally, the proposed algorithm dynamically calculates the compression ratio. Thus the results show that pattern matching extended Compression algorithm outperforms cutting-edge compressors and proves its efficiency in terms of compression ratio regardless of the file size of the data.

Keywords

References

Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G. and Iqbal, Z.: "Ultrafast search of all deposited bacterial and viral genomic data". Nature biotechnology, vol. 37(2), pp.152-159(2019) https://doi.org/10.1038/s41587-018-0010-1
Chikhi, Rayan, Jan Holub, and Paul Medvedev: "Data structures to represent sets of k-long DNA sequences". arXiv preprint arXiv:1903.12312(2019)
Brinda, Karel, Michael Baym, and Gregory Kucherov: "Simplitigs as an efficient and scalable representation of de Bruijn graphs". Genome biology 22, vol. 1, pp.1-24(2021)
Kryukov, K., Ueda, M.T., Nakagawa, S. and Imanishi, T.: "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, vol. 35(19), pp.3826-3828(2019) https://doi.org/10.1093/bioinformatics/btz144
Al-Okaily, A., Almarri, B., Al Yami, S. and Huang, C.H.: "Toward a better compression for DNA sequences using Huffman encoding". Journal of Computational Biology, vol. 24(4), pp.280-288(2017) https://doi.org/10.1089/cmb.2016.0151
Solomon, B. and Kingsford, C.: "Fast search of thousands of short-read sequencing experiments". Nature biotechnology, vol. 34(3), pp.300-302(2016) https://doi.org/10.1038/nbt.3442
Khairy, Reem, Mona Safar, and Watheq El-Kharashi. M.: "Bloom filter acceleration: A high level synthesis approach". In the proceedings of 30th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1-6. Canada(2017)
Deorowicz, S.: "FQSqueezer: k-mer-based compression of sequencing data". Scientific reports, vol. 10(1), pp.1-9 (2020) https://doi.org/10.1038/s41598-019-56847-4
Bingmann, Timo, Phelim Bradley, Florian Gauger, and Zamin Iqbal: "Cobs: a compact bit-sliced signature index. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 285-303. Springer, Cham(2019)
Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G. and Iqbal, Z.: "Ultrafast search of all deposited bacterial and viral genomic data". Nature biotechnology, vol. 37(2), pp.152-159(2019) https://doi.org/10.1038/s41587-018-0010-1
Holley, G., Wittler, R. and Stoye, J.: "Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage". Algorithms for Molecular Biology, vol. 11(1), pp.1-9(2016) https://doi.org/10.1186/s13015-016-0063-y
Marchiori, D. and Comin, M.: "SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers". Bioinformatics, pp. 59-67(2017)
Chikhi, R., Holub, J. and Medvedev, P.: "Data structures to represent sets of k-long DNA sequences", arXiv preprint arXiv: 1903.12312(2019)
Kryukov, K., Ueda, M.T., Nakagawa, S. and Imanishi, T.: "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences". Bioinformatics, vol. 35(19), pp. 3826-3828(2019) https://doi.org/10.1093/bioinformatics/btz144
Pratas, D., Pinho, A.J. and Ferreira, P.J.: "Efficient compression of genomic sequences". In Proceedings of Data compression conference (DCC), pp. 231-240, IEEE, USA(2016)
Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M. and Weissman, T.: "SPRING: a next-generation compressor for FASTQ data". Bioinformatics, vol. 35(15), pp. 2674-2676(2019) https://doi.org/10.1093/bioinformatics/bty1015
Liu, Y., Yu, Z., Dinger, M.E. and Li, J.: "Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression". Bioinformatics, vol. 35(12), pp.2066-2074(2019) https://doi.org/10.1093/bioinformatics/bty936
Hernaez, M., Ochoa, I. and Weissman, T.: "A cluster-based approach to compression of quality scores". In 2016 Data Compression Conference (DCC), pp. 261-270, IEEE, USA(2016)
Pratas, D., Hosseini, M., Silva, J.M. and Pinho, A.J.: "A reference-free lossless compression algorithm for DNA sequences using a competitive prediction of two classes of weighted models". Entropy, vol. 21(11), p.1074(2019) https://doi.org/10.3390/e21111074
Long, H., Sung, W., Kucukyildirim, S., Williams, E., Miller, S.F., Guo, W., Patterson, C., Gregory, C., Strauss, C., Stone, C. and Berne, C.: "Evolutionary determinants of genome-wide nucleotide composition". Nature ecology & evolution, vol. 2(2), pp.237-240(2018) https://doi.org/10.1038/s41559-017-0425-y
Hernaez, M., Pavlichin, D., Weissman, T. and Ochoa, I.: "Genomic data compression". Annual Review of Biomedical Data Science, vol. 2, pp.19-37(2019) https://doi.org/10.1146/annurev-biodatasci-072018-021229
Hosseini, M., Pratas, D. and Pinho, A.J.: "A survey on data compression methods for biological sequences". Information, vol. 7(4), p.56(2016) https://doi.org/10.3390/info7040056
Bonfield, J.K., McCarthy, S.A. and Durbin, R.: "Crumble: reference free lossy compression of sequence quality values". Bioinformatics, vol. 35(2), pp. 337-339(2019) https://doi.org/10.1093/bioinformatics/bty608
Chandak, S., Tatwawadi, K. and Weissman, T.: "Compression of genomic sequencing reads via hashbased reordering: algorithm and analysis". Bioinformatics, vol. 34(4), pp. 558- 567(2018) https://doi.org/10.1093/bioinformatics/btx639
Ginart, A.A., Hui, J., Zhu, K., Numanagic, I., Courtade, T.A., Sahinalp, S.C. and David, N.T.: "Optimal compressed representation of high throughput sequence data via light assembly". Nature communications, vol. 9(1), pp. 1-9(2018) https://doi.org/10.1038/s41467-017-02088-w
Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T. and Ashley, E.: "Effect of lossy compression of quality scores on variant calling". Briefings in bioinformatics, vol. 18(2), pp. 183-194(2017)
Pamela Vinitha, E., Gopalakrishnan, G. and Karunakaran, M.: "An optimal seed based compression algorithm for DNA sequences". Advances in Bioinformatics, vol. 2016, Article ID 3528406(2016)
Punitha K. and Murugan A.: "Pattern Matching Compression Algorithm for DNA Sequences", In: Proceedings of the International Conference on Sustainable Expert System, vol.176, pp. 387-402, Nepal(2021).

International Journal of Computer Science & Network Security

A Pattern Matching Extended Compression Algorithm for DNA Sequences

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)