• Title/Summary/Keyword: De novo assembly

Search Result 55, Processing Time 0.027 seconds

Experimental Analysis of Recent Works on the Overlap Phase of De Novo Sequence Assembly (De novo 시퀀스 어셈블리의 overlap 단계의 최근 연구 실험 분석)

  • Lim, Jihyuk;Kim, Sun;Park, Kunsoo
    • Journal of KIISE
    • /
    • v.45 no.3
    • /
    • pp.200-210
    • /
    • 2018
  • Given a set of DNA read sequences, de novo sequence assembly reconstructs a target sequence without a reference sequence. For reconstruction, the assembly needs the overlap phase, which computes all overlaps between every pair of reads. Since the overlap phase is the most time-consuming part of the whole assembly, the performance of the assembly depends on that of the overlap phase. There have been extensive studies on the overlap phase in various fields. Among them, three state-of-the-art results for the overlap phase are Readjoiner, SOF, and Lim-Park algorithm. Recently, a rapid development of sequencing technology has made it possible to produce a large read dataset at a low cost, and many platforms for generating a DNA read dataset have been developed. Since the platforms produce datasets with different statistical characteristics, a performance evaluation for the overlap phase should consider datasets with these characteristics. In this paper, we compare and analyze the performances of the three algorithms with various large datasets.

K-mer Based RNA-seq Read Distribution Method For Accelerating De Novo Transcriptome Assembly

  • Kwon, Hwijun;Jung, Inuk
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.8
    • /
    • pp.1-8
    • /
    • 2020
  • In this paper, we propose a gene family based RNA-seq read distribution method in means to accelerate the overal transcriptome assembly computation time. To measure the performance of our transcriptome sequence data distribution method, we evaluated the performance by testing four types of data sets of the Arabidopsis thaliana genome (Whole Unclassified Reads, Family-Classified Reads, Model-Classified Reads, and Randomly Classified Reads). As a result of de novo transcript assembly in distributed nodes using model classification data, the generated gene contigs matched 95% compared to the contig generated by WUR, and the execution time was reduced by 4.2 times compared to a single node environment using the same resources.

De novo gene set assembly of the transcriptome of diploid, oilseed-crop species Perilla citriodora

  • Kim, Ji-Eun;Choe, Junkyoung;Lee, Woo Kyung;Kim, Sangmi;Lee, Myoung Hee;Kim, Tae-Ho;Jo, Sung-Hwan;Lee, Jeong Hee
    • Journal of Plant Biotechnology
    • /
    • v.43 no.3
    • /
    • pp.293-301
    • /
    • 2016
  • High-quality gene sets are necessary for functional research of genes. Although Perilla is a commonly cultivated oil crop and vegetable crop in Southeast Asia, the quality of its available gene set is insufficient. To construct a high-quality Perilla gene set, we sequenced mRNAs extracted from different tissues of Perilla citriodora, the wild species (2n = 20) of Perilla. To make a high-quality gene set for P. citriodora, we compared the quality of assemblies produced by Velvet and Trinity, the two well-known de novo assemblers, and improved the de novo assembly pipeline by optimizing k-mers and removing redundant sequences. We then selected representative transcripts for loci according to several criteria. The improved assembly yielded a total of 86,396 transcripts and 38,413 representative transcripts. We evaluated the assembled transcripts by comparing them to 638 homologous Arabidopsis genes involved in fatty acid and TAG biosynthesis pathways. High proportions of full-length genes and transcripts in the assembled transcripts matched known genes in other species, indicating that the P. citriodora gene set can be applied in future functional studies. Our study provides a reference P. citriodora gene set for further studies. It will serve as valuable genetic resource to elucidate the molecular basis of various metabolisms.

Survey of the Applications of NGS to Whole-Genome Sequencing and Expression Profiling

  • Lim, Jong-Sung;Choi, Beom-Soon;Lee, Jeong-Soo;Shin, Chan-Seok;Yang, Tae-Jin;Rhee, Jae-Sung;Lee, Jae-Seong;Choi, Ik-Young
    • Genomics & Informatics
    • /
    • v.10 no.1
    • /
    • pp.1-8
    • /
    • 2012
  • Recently, the technologies of DNA sequence variation and gene expression profiling have been used widely as approaches in the expertise of genome biology and genetics. The application to genome study has been particularly developed with the introduction of the nextgeneration DNA sequencer (NGS) Roche/454 and Illumina/ Solexa systems, along with bioinformation analysis technologies of whole-genome $de$ $novo$ assembly, expression profiling, DNA variation discovery, and genotyping. Both massive whole-genome shotgun paired-end sequencing and mate paired-end sequencing data are important steps for constructing $de$ $novo$ assembly of novel genome sequencing data. It is necessary to have DNA sequence information from a multiplatform NGS with at least $2{\times}$ and $30{\times}$ depth sequence of genome coverage using Roche/454 and Illumina/Solexa, respectively, for effective an way of de novo assembly. Massive shortlength reading data from the Illumina/Solexa system is enough to discover DNA variation, resulting in reducing the cost of DNA sequencing. Whole-genome expression profile data are useful to approach genome system biology with quantification of expressed RNAs from a wholegenome transcriptome, depending on the tissue samples. The hybrid mRNA sequences from Rohce/454 and Illumina/Solexa are more powerful to find novel genes through $de$ $novo$ assembly in any whole-genome sequenced species. The $20{\times}$ and $50{\times}$ coverage of the estimated transcriptome sequences using Roche/454 and Illumina/Solexa, respectively, is effective to create novel expressed reference sequences. However, only an average $30{\times}$ coverage of a transcriptome with short read sequences of Illumina/Solexa is enough to check expression quantification, compared to the reference expressed sequence tag sequence.

Draft genome of Semisulcospira libertina, a species of freshwater snail

  • Gim, Jeong-An;Baek, Kyung-Wan;Hah, Young-Sool;Choo, Ho Jin;Kim, Ji-Seok;Yoo, Jun-Il
    • Genomics & Informatics
    • /
    • v.19 no.3
    • /
    • pp.32.1-32.10
    • /
    • 2021
  • Semisulcospira libertina, a species of freshwater snail, is widespread in East Asia. It is important as a food source. Additionally, it is a vector of clonorchiasis, paragonimiasis, metagonimiasis, and other parasites. Although S. libertina has ecological, commercial, and clinical importance, its whole-genome has not been reported yet. Here, we revealed the genome of S. libertina through de novo assembly. We assembled the whole-genome of S. libertina and determined its transcriptome for the first time using Illumina NovaSeq 6000 platform. According to the k-mer analysis, the genome size of S. libertina was estimated to be 3.04 Gb. Using RepeatMasker, a total of 53.68% of repeats were identified in the genome assembly. Genome data of S. libertina reported in this study will be useful for identification and conservation of S. libertina in East Asia.

Workflow for Building a Draft Genome Assembly using Public-domain Tools: Toxocara canis as a Case Study (개 회충 게놈 응용 사례에서 공개용 분석 툴을 사용한 드래프트 게놈 어셈블리 생성)

  • Won, JungIm;Kong, JinHwa;Huh, Sun;Yoon, JeeHee
    • KIISE Transactions on Computing Practices
    • /
    • v.20 no.9
    • /
    • pp.513-518
    • /
    • 2014
  • It has become possible for small scale laboratories to interpret large scale genomic DNA, thanks to the reduction of the sequencing cost by the development of next generation sequencing (NGS). De novo assembly is a method which creates a putative original sequence by reconstructing reads without using a reference sequence. There have been various study results on de novo assembly, however, it is still difficult to get the desired results even by using the same assembly procedures and the analysis tools which were suggested in the studies reported. This is mainly because there are no specific guidelines for the assembly procedures or know-hows for the use of such analysis tools. In this study, to resolve these problems, we introduce steps to finding whole genome of an unknown DNA via NGS technology and de novo assembly, while providing the pros and cons of the various analysis tools used in each step. We used 350Mbp of Toxocara canis DNA as an application case for the detailed explanations of each stated step. We also extend our works for prediction of protein-coding genes and their functions from the draft genome sequence by comparing its homology with reference sequences of other nematodes.

Birth of an 'Asian cool' reference genome: AK1

  • Kim, Changhoon
    • BMB Reports
    • /
    • v.49 no.12
    • /
    • pp.653-654
    • /
    • 2016
  • The human reference genome, maintained by the Genome Reference Consortium, is conceivably the most complete genome assembly ever, since its first construction. It has continually been improved by incorporating corrections made to the previous assemblies, thanks to various technological advances. Many currently-ongoing population sequencing projects have been based on this reference genome, heightening hopes of the development of useful medical applications of genomic information, thanks to the recent maturation of high-throughput sequencing technologies. However, just one reference genome does not fit all the populations across the globe, because of the large diversity in genomic structures and technical limitations inherent to short read sequencing methods. The recent success in de novo construction of the highly contiguous Asian diploid genome AK1, by combining single molecule technologies with routine sequencing data without resorting to traditional clone-by-clone sequencing and physical mapping, reveals the nature of genomic structure variation by detecting thousands of novel structural variations and by finally filling in some of the prior gaps which had persistently remained in the current human reference genome. Now it is expected that the AK1 genome, soon to be paired with more upcoming de novo assembled genomes, will provide a chance to explore what it is really like to use ancestry-specific reference genomes instead of hg19/hg38 for population genomics. This is a major step towards the furthering of genetically-based precision medicine.

De novo transcriptome sequencing and gene expression profiling with/without B-chromosome plants of Lilium amabile

  • Park, Doori;Kim, Jong-Hwa;Kim, Nam-Soo
    • Genomics & Informatics
    • /
    • v.17 no.3
    • /
    • pp.27.1-27.9
    • /
    • 2019
  • Supernumerary B chromosomes were found in Lilium amabile (2n = 2x = 24), an endemic Korean lily that grows in the wild throughout the Korean Peninsula. The extra B chromosomes do not affect the host-plant morphology; therefore, whole transcriptome analysis was performed in 0B and 1B plants to identify differentially expressed genes. A total of 154,810 transcripts were obtained from over 10 Gbp data by de novo assembly. By mapping the raw reads to the de novo transcripts, we identified 7,852 differentially expressed genes (log2FC > |10|), in which 4,059 and 3,794 were up-and down-regulated, respectively, in 1B plants compared to 0B plants. Functional enrichment analysis revealed that various differentially expressed genes were involved in cellular processes including the cell cycle, chromosome breakage and repair, and microtubule formation; all of which may be related to the occurrence and maintenance of B chromosomes. Our data provide insight into transcriptomic changes and evolution of plant B chromosomes and deliver an informative database for future study of B chromosome transcriptomes in the Korean lily.

A Study on Transcriptome Analysis Using de novo RNA-sequencing to Compare Ginseng Roots Cultivated in Different Environments

  • Yang, Byung Wook
    • Proceedings of the Plant Resources Society of Korea Conference
    • /
    • 2018.04a
    • /
    • pp.5-5
    • /
    • 2018
  • Ginseng (Panax ginseng C.A. Meyer), one of the most widely used medicinal plants in traditional oriental medicine, is used for the treatment of various diseases. It has been classified according to its cultivation environment, such as field cultivated ginseng (FCG) and mountain cultivated ginseng (MCG). However, little is known about differences in gene expression in ginseng roots between field cultivated and mountain cultivated ginseng. In order to investigate the whole transcriptome landscape of ginseng, we employed High-Throughput sequencing technologies using the Illumina HiSeqTM2500 system, and generated a large amount of sequenced transcriptome from ginseng roots. Approximately 77 million and 87 million high-quality reads were produced in the FCG and MCG roots transcriptome analyses, respectively, and we obtained 256,032 assembled unigenes with an average length of 1,171 bp by de novo assembly methods. Functional annotations of the unigenes were performed using sequence similarity comparisons against the following databases: the non-redundant nucleotide database, the InterPro domains database, the Gene Ontology Consortium database, and the Kyoto Encyclopedia of Genes and Genomes pathway database. A total of 4,207 unigenes were assigned to specific metabolic pathways, and all of the known enzymes involved in starch and sucrose metabolism pathways were also identified in the KEGG library. This study indicated that alpha-glucan phosphorylase 1, putative pectinesterase/pectinesterase inhibitor 17, beta-amylase, and alpha-glucan phosphorylase isozyme H might be important factors involved in starch and sucrose metabolism between FCG and MCG in different environments.

  • PDF

Draft Genome of Toxocara canis, a Pathogen Responsible for Visceral Larva Migrans

  • Kong, Jinhwa;Won, Jungim;Yoon, Jeehee;Lee, UnJoo;Kim, Jong-Il;Huh, Sun
    • Parasites, Hosts and Diseases
    • /
    • v.54 no.6
    • /
    • pp.751-758
    • /
    • 2016
  • This study aimed at constructing a draft genome of the adult female worm Toxocara canis using next-generation sequencing (NGS) and de novo assembly, as well as to find new genes after annotation using functional genomics tools. Using an NGS machine, we produced DNA read data of T. canis. The de novo assembly of the read data was performed using SOAPdenovo. RNA read data were assembled using Trinity. Structural annotation, homology search, functional annotation, classification of protein domains, and KEGG pathway analysis were carried out. Besides them, recently developed tools such as MAKER, PASA, Evidence Modeler, and Blast2GO were used. The scaffold DNA was obtained, the N50 was 108,950 bp, and the overall length was 341,776,187 bp. The N50 of the transcriptome was 940 bp, and its length was 53,046,952 bp. The GC content of the entire genome was 39.3%. The total number of genes was 20,178, and the total number of protein sequences was 22,358. Of the 22,358 protein sequences, 4,992 were newly observed in T. canis. Following proteins previously unknown were found: E3 ubiquitin-protein ligase cbl-b and antigen T-cell receptor, zeta chain for T-cell and B-cell regulation; endoprotease bli-4 for cuticle metabolism; mucin 12Ea and polymorphic mucin variant C6/1/40r2.1 for mucin production; tropomodulin-family protein and ryanodine receptor calcium release channels for muscle movement. We were able to find new hypothetical polypeptides sequences unique to T. canis, and the findings of this study are capable of serving as a basis for extending our biological understanding of T. canis.