Browse > Article
http://dx.doi.org/10.22937/IJCSNS.2021.21.8.37

An Efficient DNA Sequence Compression using Small Sequence Pattern Matching  

Murugan., A (Dr. Ambedkar Government College)
Punitha., K (Agurchand Manmull Jain College)
Publication Information
International Journal of Computer Science & Network Security / v.21, no.8, 2021 , pp. 281-287 More about this Journal
Abstract
Bioinformatics is formed with a blend of biology and informatics technologies and it employs the statistical methods and approaches for attending the concerning issues in the domains of nutrition, medical research and towards reviewing the living environment. The ceaseless growth of DNA sequencing technologies has resulted in the production of voluminous genomic data especially the DNA sequences thus calling out for increased storage and bandwidth. As of now, the bioinformatics confronts the major hurdle of management, interpretation and accurately preserving of this hefty information. Compression tends to be a beacon of hope towards resolving the aforementioned issues. Keeping the storage efficiently, a methodology has been recommended which for attending the same. In addition, there is introduction of a competent algorithm that aids in exact matching of small pattern. The DNA representation sequence is then implemented subsequently for determining 2 bases to 6 bases matching with the remaining input sequence. This process involves transforming of DNA sequence into an ASCII symbols in the first level and compress by using LZ77 compression method in the second level and after that form the grid variables with size 3 to hold the 100 characters. In the third level of compression, the compressed output is in the grid variables. Hence, the proposed algorithm S_Pattern DNA gives an average better compression ratio of 93% when compared to the existing compression algorithms for the datasets from the UCI repository.
Keywords
DNA sequences; ASCII symbol; Binary Codes; compression; pattern matching; Data compression;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Liu Y., Yu Z. and Li J., "Index suffix-prefix overlaps by (w; k)-minimizer to generate long contigs for reads compression", Bioinformatics, vol. 35(12), pp. 2066-2074, 2018.   DOI
2 Milton Silva, Diogo Pratas and Armando J Pinho, "Efficient DNA sequence compression with neural networks, GigaScience, Vol. 9(11), pp. 1-15, 2020.
3 Greenfield, "GeneCodeq: quality score compression and improved genotyping using a Bayesian framework", Bioinformatics, vol. 32(20), pp. 3124-3132, 2016   DOI
4 Bonfield J. K. and McCarthy, "Crumble: reference free lossy compression of sequence quality values", Bioinformatics, vol. 35(2), pp. 337-339, 2019.   DOI
5 Du S., Li J. and Bian N., "A compression method for DNA", PLOS ONE Journal, vol. 15(11), Article ID: e0238220, 2020.
6 Kavitha P., "A Survey on Lossless and Lossy Data Compression Methods," International Journal of Computer Science Engineering Technology (IJCSET), vol. 7(3), pp. 110-114, 2016.
7 Nirmala Devi S., Rajagopalan P. and Anuradha V., "Index based multiple pattern matching algorithm using frequent character count in patterns", International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3(5), 2013.
8 Prashant Pandey and Rob Patro, "Squeakr: an exact and approximate k-mer counting system", Bioinformatics, vol. 34(4), pp. 568-575, 2017.   DOI
9 Karel Brinda, "Novel computational techniques for mapping and classifying Next-Generation Sequencing data", PhD thesis, Universite Paris-Est, November 2016.
10 Bhukya, R., and Somayajulu, D. V. L. N., "Exact multiple pattern matching algorithm using DNA sequence and pattern pair. International Journal of Computer Applications, vol. 17(8), pp. 32-38, 2011.   DOI
11 Zheng X., "SeqArray-a storage-efficient high-performance data format for WGS variant calls", Bioinformatics, vol. 33(15), pp. 2251-2257, 2017.   DOI
12 Numanagic, I., Bonfield, J. K., Hach, F., Voges, J., Ostermann, J., Alberti, C., and Sahinalp, S. C., "Comparison of high-throughput sequencing data compression tools", nature methods, vol. 13(12), pp. 1005-1008, 2016.   DOI
13 Wu L., Yavas G., "Direct comparison of performance of single nucleotide variant calling in human genome with alignment-based and assembly-based approaches", Scientific reports, vol. 7(1), pp. 1-9, 2017.   DOI
14 Danek, A., and Deorowicz, S., "GTC: a novel attempt to maintenance of huge genome collections compressed", BioRxiv, Article ID. 131649, 2017.
15 Chikhi R., Limasset A., and Medvedev P., "Compacting de Bruijn graphs from sequencing data quickly and in low memory", Bioinformatics, vol. 32(12), pp. 201-208, 2016.
16 Eric, Pamela Vinitha, Gopakumar Gopalakrishnan and Muralikrishnan Karunakaran, "An optimal seed based compression algorithm for DNA sequences", Advances in Bioinformatics, vol. 2016, Article ID 3528406, pp. 1-7, 2016.
17 Punitha K. and Murugan A., "Pattern Matching Compression Algorithm for DNA Sequences", In: Proceedings of the International Conference on Sustainable Expert System, vol.176, pp. 387-402, Tribhuvan University, Nepal, 2021.
18 Murugan A., Lavanya B. and Shyamala K., "A Novel Programming Approach for DNA Computing", International Journal of Computational Intelligence Research, vol. 7(2), pp. 199-209, 2011.
19 Khairy R., Safar M., and El-Kharashi M.W., "Bloom filter acceleration: A high level synthesis approach". In: Proceedings of IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1-6, Windsor, ON, Canada, 2017.
20 Heng Li., "BGT: Efficient and flexible genotype query across many samples", Bioinformatics, vol. 32(4), pp. 590-592, 2016.   DOI
21 Deorowicz and Sebastian, "FQSqueezer: k-mer-based compression of sequencing data", Scientific reports, vol. 10(1), pp. 1-9, 2020.   DOI
22 Pothuraju Rajarajeswari and Allam Apparao, "DNABIT Compress - Genome Compression Algorithm, Bioinformation, vol. 5(8), pp. 350-360, 2011.   DOI
23 Gopinath A. and Ravisankar M, "Comparison of lossless data compression techniques", In: IEEE International Conference on Inventive Computation Technologies (ICICT), pp. 628-633, Coimbatore, India, 2020.
24 Lau A.K., Dorrer S. and Leimeister C.A., "Read-SpaM: Assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage", BMC Bioinformatics, vol. 20(20), pp. 1-15, 2019.   DOI