DOI QR코드

DOI QR Code

Big Data Analytics in RNA-sequencing

RNA 시퀀싱 기법으로 생성된 빅데이터 분석

  • Sung-Hun WOO (Department of Biomedical Laboratory Science, College of Software and Digital Healthcare Convergence, Yonsei University) ;
  • Byung Chul JUNG (Department of Nutritional Sciences and Toxicology, University of California)
  • 우성훈 (연세대학교 소프트웨어디지털헬스케어융합대학 임상병리학과) ;
  • 정병출 (캘리포니아대학교 버클리캠퍼스 영양과학 및 독성학과)
  • Received : 2023.10.31
  • Accepted : 2023.11.13
  • Published : 2023.12.31

Abstract

As next-generation sequencing has been developed and used widely, RNA-sequencing (RNA-seq) has rapidly emerged as the first choice of tools to validate global transcriptome profiling. With the significant advances in RNA-seq, various types of RNA-seq have evolved in conjunction with the progress in bioinformatic tools. On the other hand, it is difficult to interpret the complex data underlying the biological meaning without a general understanding of the types of RNA-seq and bioinformatic approaches. In this regard, this paper discusses the two main sections of RNA-seq. First, two major variants of RNA-seq are described and compared with the standard RNA-seq. This provides insights into which RNA-seq method is most appropriate for their research. Second, the most widely used RNA-seq data analyses are discussed: (1) exploratory data analysis and (2) pathway enrichment analysis. This paper introduces the most widely used exploratory data analysis for RNA-seq, such as principal component analysis, heatmap, and volcano plot, which can provide the overall trends in the dataset. The pathway enrichment analysis section introduces three generations of pathway enrichment analysis and how they generate enriched pathways with the RNA-seq dataset.

차세대 염기서열 분석이 개발되고 널리 사용됨에 따라 RNA-시퀀싱(RNA-sequencing, RNA-seq)이 글로벌 전사체 프로파일링을 검증하기 위한 도구의 첫번째 선택으로 급부상하게 되었다. RNA-seq의 상당한 발전으로 다양한 유형의 RNA-seq가 생물정보학(bioinformatics) 발전과 함께 진화했으나, 다양한 RNA-seq 기법 및 생물정보학에 대한 전반적인 이해 없이는 RNA-seq의 복잡한 데이터를 해석하여 생물학적 의미를 도출하기는 어렵다. 이와 관련하여 본 리뷰에서는 RNA-seq의 두 가지 주요 섹션을 논의하고 있다. 첫째, Standard RNA-seq과 주요하게 자주 사용되는 두 가지 RNA-seq variant method를 비교하였다. 이 비교는 어떤 RNA-seq 방법이 연구 목적에 가장 적절한지에 대한 시사점을 제공한다. 둘째, 가장 널리 사용되는 RNA-seq에서 생성된 데이터 분석; (1) 탐색적 자료 분석 및 (2) enriched pathway 분석에 대해 논의하였다. 데이터 세트의 전반적인 추세를 제공할 수 있는 주 성분 분석, Heatmap 및 Volcano plot과 같이 RNA-seq에 대해 가장 널리 사용되는 탐색적 자료 분석을 소개하였다. Enriched pathway 분석 섹션에서는 3가지 세대의 enriched pathway 분석에 대해 소개하고 각 세대가 어떤 식으로 RNA-seq 데이터 세트로부터 enriched pathway를 도출하는지를 소개하였다.

Keywords

References

  1. Jung BC, Kang S. Epigenetic regulation of inflammatory factors in adipose tissue. Biochim Biophys Acta Mol Cell Biol Lipids. 2021;1866:159019. https://doi.org/10.1016/j.bbalip.2021.159019
  2. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545-15550. https://doi.org/10.1073/pnas.0506580102
  3. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467-470. https://doi.org/10.1126/science.270.5235.467
  4. Mutz KO, Heilkenbrinker A, Lonne M, Walter JG, Stahl F. Transcriptome analysis using next-generation sequencing. Curr Opin Biotechnol. 2013;24:22-30. https://doi.org/10.1016/j.copbio.2012.09.004
  5. Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS One. 2014;9:e78644. https://doi.org/10.1371/journal.pone.0078644
  6. Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019;20:631-656. https://doi.org/10.1038/s41576-019-0150-2
  7. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57-63. https://doi.org/10.1038/nrg2484
  8. Sigurgeirsson B, Emanuelsson O, Lundeberg J. Sequencing degraded RNA addressed by 3' tag counting. PLoS One. 2014;9:e91851. https://doi.org/10.1371/journal.pone.0091851
  9. Jung BC, You D, Lee I, Li D, Schill RL, Ma K, et al. TET3 plays a critical role in white adipose development and diet-induced remodeling. Cell Rep. 2023;42:113196. https://doi.org/10.1016/j.celrep.2023.113196
  10. Weng X, Juenger TE. A high-throughput 3'-Tag RNA sequencing for large-scale time-series transcriptome studies. Methods Mol Biol. 2022;2398:151-172. https://doi.org/10.1007/978-1-0716-1912-4_13
  11. Wu X, Bartel DP. Widespread influence of 3'-end structures on mammalian mRNA processing and stability. Cell. 2017;169:905-917.e11. https://doi.org/10.1016/j.cell.2017.04.036
  12. Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform. 2022;23:bbab563. https://doi.org/10.1093/bib/bbab563
  13. Liao X, Li M, Zou Y, Wu FX, Yi P, Wang J. Current challenges and solutions of de novo assembly. Quant Biol. 2019;7:90-109. https://doi.org/10.1007/s40484-019-0166-9
  14. Teo YY. Exploratory data analysis in large-scale genetic studies. Biostatistics. 2010;11:70-81. https://doi.org/10.1093/biostatistics/kxp038
  15. Koch CM, Chiu SF, Akbarpour M, Bharat A, Ridge KM, Bartom ET, et al. A beginner's guide to analysis of RNA sequencing data. Am J Respir Cell Mol Biol. 2018;59:145-157. https://doi.org/10.1165/rcmb.2017-0430tr
  16. Chen X, Zhang B, Wang T, Bonni A, Zhao G. Robust principal component analysis for accurate outlier sample detection in RNA-Seq data. BMC Bioinformatics. 2020;21:269. https://doi.org/10.1186/s12859-020-03608-0
  17. Ringner M. What is principal component analysis? Nat Biotechnol. 2008;26:303-304. https://doi.org/10.1038/nbt0308-303
  18. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374:20150202. https://doi.org/10.1098/rsta.2015.0202
  19. Khomtchouk BB, Van Booven DJ, Wahlestedt C. HeatmapGenerator: high performance RNAseq and microarray visualization software suite to examine differential gene expression levels using an R and C++ hybrid computational pipeline. Source Code Biol Med. 2014;9:30. https://doi.org/10.1186/s13029-014-0030-2
  20. Engle S, Whalen S, Joshi A, Pollard KS. Unboxing cluster heatmaps. BMC Bioinformatics. 2017;18(Suppl 2):63. https://doi.org/10.1186/s12859-016-1442-6
  21. Gu Z. Complex heatmap visualization. iMeta. 2022;1:e43. https://doi.org/10.1002/imt2.43
  22. El Bouchefry K, de Souza RS. Learning in big data: introduction to machine learning. In: Skoda P, Adam F, editors. Knowledge discovery in big data from astronomy and earth observation: AstroGeoInformatics. Elsevier: 2020. p. 225-249.
  23. Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32:2847-2849. https://doi.org/10.1093/bioinformatics/btw313
  24. Li W. Volcano plots in analyzing differential expressions with mRNA microarrays. J Bioinform Comput Biol. 2012;10:1231003. https://doi.org/10.1142/s0219720012310038
  25. Ebrahimpoor M, Goeman JJ. Inflated false discovery rate due to volcano plots: problem and solutions. Brief Bioinform. 2021;22:bbab053. https://doi.org/10.1093/bib/bbab053
  26. Bedre R. reneshbedre/bioinfokit: bioinformatics data analysis and visualization toolkit [Internet]. Zenodo [cited 2022 Sep 4]. Available from: https://doi.org/10.5281/zenodo.3698145
  27. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733-739. https://doi.org/10.1038/nrg2825
  28. Reimand J, Isserlin R, Voisin V, Kucera M, Tannus-Lopes C, Rostamianfar A, et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc. 2019;14:482-517. https://doi.org/10.1038/s41596-018-0103-9
  29. Garcia-Campos MA, Espinal-Enriquez J, Hernandez-Lemus E. Pathway analysis: state of the art. Front Physiol. 2015;6:383. https://doi.org/10.3389/fphys.2015.00383
  30. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8:e1002375. https://doi.org/10.1371/journal.pcbi.1002375
  31. Jung SH. Stratified Fisher's exact test and its sample size calculation. Biom J. 2014;56:129-140. https://doi.org/10.1002/bimj.201300048
  32. Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Stat Sci. 2003;18:71-103. https://doi.org/10.1214/ss/1056397487
  33. Camargo A, Azuaje F, Wang H, Zheng H. Permutation - based statistical tests for multiple hypotheses. Source Code Biol Med. 2008;3:15. https://doi.org/10.1186/1751-0473-3-15
  34. Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Stat Med. 1990;9:811-818. https://doi.org/10.1002/sim.4780090710
  35. Xie C, Jauhari S, Mora A. Popularity and performance of bioinformatics software: the case of gene set analysis. BMC Bioinformatics. 2021;22:191. https://doi.org/10.1186/s12859-021-04124-5
  36. Fang Z, Liu X, Peltz G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics. 2023;39:btac757. https://doi.org/10.1093/bioinformatics/btac757
  37. Tamayo P, Steinhardt G, Liberzon A, Mesirov JP. The limitations of simple gene set enrichment analysis assuming gene independence. Stat Methods Med Res. 2016;25:472-487. https://doi.org/10.1177/0962280212460441
  38. Wang Y, Li J, Huang D, Hao Y, Li B, Wang K, et al. Comparing Bayesian-based reconstruction strategies in topology-based pathway enrichment analysis. Biomolecules. 2022;12:906. https://doi.org/10.3390/biom12070906
  39. Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, et al. A novel signaling pathway impact analysis. Bioinformatics. 2009;25:75-82. https://doi.org/10.1093/bioinformatics/btn577
  40. Grassi M, Tarantino B. SEMgsa: topology-based pathway enrichment analysis with structural equation models. BMC Bioinformatics. 2022;23:344. https://doi.org/10.1186/s12859-022-04884-8
  41. Ma J, Shojaie A, Michailidis G. A comparative study of topology-based pathway enrichment analysis methods. BMC Bioinformatics. 2019;20:546. https://doi.org/10.1186/s12859-019-3146-1
  42. Ibrahim MA, Jassim S, Cawthorne MA, Langlands K. A topology-based score for pathway enrichment. J Comput Biol. 2012;19:563-573. https://doi.org/10.1089/cmb.2011.0182
  43. Zhao K, Rhee SY. Interpreting omics data with pathway enrichment analysis. Trends Genet. 2023;39:308-319. https://doi.org/10.1016/j.tig.2023.01.003
  44. Nguyen TM, Shafi A, Nguyen T, Draghici S. Identifying significantly impacted pathways: a comprehensive review and assessment. Genome Biol. 2019;20:203. https://doi.org/10.1186/s13059-019-1790-4