Browse > Article
http://dx.doi.org/10.9708/jksci.2020.25.08.001

K-mer Based RNA-seq Read Distribution Method For Accelerating De Novo Transcriptome Assembly  

Kwon, Hwijun (School of Computer Science and Engineering, Kyungpook National University)
Jung, Inuk (School of Computer Science and Engineering, Kyungpook National University)
Abstract
In this paper, we propose a gene family based RNA-seq read distribution method in means to accelerate the overal transcriptome assembly computation time. To measure the performance of our transcriptome sequence data distribution method, we evaluated the performance by testing four types of data sets of the Arabidopsis thaliana genome (Whole Unclassified Reads, Family-Classified Reads, Model-Classified Reads, and Randomly Classified Reads). As a result of de novo transcript assembly in distributed nodes using model classification data, the generated gene contigs matched 95% compared to the contig generated by WUR, and the execution time was reduced by 4.2 times compared to a single node environment using the same resources.
Keywords
Gene Family; De novo transcriptome assembly; Distribution; Acceleration; Classification model; K-mer;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Mardis, Elaine R. "The impact of next-generation sequencing technology on genetics." Trends in genetics, Vol. 24, No. 3, pp. 133-141, Mar 2008, DOI: 10.1016/j.tig.2007.12.007   DOI
2 Robert Henschel, Matthias Lieber, Le-Shin Wu, Phillip M. Nista, Brian J. Haas, and Richard D. LeDuc, "Trinity RNA-Seq assembler performance optimization", In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond (XSEDE '12). Association for Computing Machinery, New York, NY, USA, Article 45, pp. 1-8, Jul 2012, DOI: 10.1145/2335755.2335842
3 Holzer, Martin, and Manja Marz. "De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers." GigaScience, Vol. 8, May 2019, DOI: 10.1093/gigascience/giz039
4 Goswami, Sayan, et al. "Gpu-accelerated large-scale genome assembly." 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, May 2018, DOI:10.1109/IPDPS.2018.00091
5 Varma, B. Sharat Chandra, et al. "FAssem: FPGA based acceleration of de novo genome assembly." 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines. pp. 173-176, Apr 2013, ,DOI:10.1109/FCCM.2013.25.
6 Ellis, Marquita, et al. "diBELLA: Distributed long read to long read alignment." Proceedings of the 48th International Conference on Parallel Processingm, Num 70, pp. 1-11, Aug 2019, DOI:10.1145/3337821.3337919
7 Henschel, Robert, et al. "Trinity RNA-Seq assembler performance optimization." Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond , Jul 2012, DOI:10.1145/2335755.2335842.
8 Haas, B., Papanicolaou, A., Yassour, M. et al, "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis", Nature Protocols 8, pp. 1494-1512, Jul 2013, DOI: 10.1038/nprot.2013.084   DOI
9 Kim, C.S., Winn, M.D., Sachdeva, V. et al. "K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity", BMC Bioinformatics 18, Nov 2017, DOI: 10.1186/s12859-017-1881-8
10 Zhao, Q., Wang, Y., Kong, Y. et al. "Optimizing de novo transcri ptome assembly from short-read RNA-Seq data: a comparative study", BMC Bioinformatics 12, Dec 2011, DOI: 10.1186/1471-2105-12-S14-S2
11 Wagner, Michael & Fulton, Ben & Henschel, Robert. "Perform ance Optimization for the Trinity RNA-Seq Assembler", Tools for High Performance Computing 2015, pp. 29-40, Jan 2016, DOI: 10.1007/978-3-319-39589-0_3.
12 D. Yan, H. Chen, J. Cheng, Z. Cai and B. Shao, "Scalable De Novo Genome Assembly Using Pregel," 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, pp. 1216-1219, Jan 2018, DOI: 10.1109/ICDE.2018.00114.
13 Saw, A.K., Raj, G., Das, M. et al., "Alignment-free method for DNA sequence clustering using Fuzzy integral similarity", Scientific Reports volume 9, Num 3753, Mar 2019, DOI: 10.1038/s41598-019-40452-6
14 Lamesch P, Berardini TZ, Li D, et al, "The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools", Nucleic Acids Res, pp. D1202-D1210, Jan 2012, DOI:10.1093/nar/gkr1090
15 NCBI SRA database(Arabidopsis Thaliana), https://www.ncbi.nlm.nih.gov/sra/SRX5525170%5baccn%5d
16 Manchanda, N., Portwood, J.L., Woodhouse, M.R. et al. "Geno meQC: a quality assessment tool for genome assemblies and gene structure annotations", BMC Genomics 21, No 193, Mar 2020, DOI: 10.1186/s12864-020-6568-2
17 Bedre, R, Mandadi, K., "GenFam; A web application and database for gene family-based classification and functional enrichment analysis", Plant Direct, Vol. 3, pp. 1- 7, Dec 2019, DOI:10.1002/pld3.191
18 Chabikwa, T.G., Barbier, F.F., Tanurdzic, M. et al. "De novo transcriptome assembly and annotation for gene discovery in avocado, macadamia and mango.", Nature, Scientific Data vol. 7, Num. 9, Jan 2020, DOI: 10.1038/s41597-019-0350-9
19 Seokjun Seo, Minsik Oh, Youngjune Park, Sun Kim, "DeepFam: deep learning based alignment-free method for protein family modeling and prediction", Bioinformatics, Vol. 34, Num 13, pp. 254-262, Jul 2018, DOI: 10.1093/bioinformatics/bty275
20 Weizhong Li, Limin Fu, Beifang Niu, Sitao Wu, John Wooley, "Ultrafast clustering algorithms for metagenomic sequence analysis", Bioinformatics, Vol. 13, Num. 6, pp. 656-668, Nov 2012, DOI: 10.1093/bib/bbs035