• 제목/요약/키워드: genomic data

검색결과 625건 처리시간 0.029초

Design of Distributed Cloud System for Managing large-scale Genomic Data

  • Seine Jang;Seok-Jae Moon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • 제16권2호
    • /
    • pp.119-126
    • /
    • 2024
  • The volume of genomic data is constantly increasing in various modern industries and research fields. This growth presents new challenges and opportunities in terms of the quantity and diversity of genetic data. In this paper, we propose a distributed cloud system for integrating and managing large-scale gene databases. By introducing a distributed data storage and processing system based on the Hadoop Distributed File System (HDFS), various formats and sizes of genomic data can be efficiently integrated. Furthermore, by leveraging Spark on YARN, efficient management of distributed cloud computing tasks and optimal resource allocation are achieved. This establishes a foundation for the rapid processing and analysis of large-scale genomic data. Additionally, by utilizing BigQuery ML, machine learning models are developed to support genetic search and prediction, enabling researchers to more effectively utilize data. It is expected that this will contribute to driving innovative advancements in genetic research and applications.

A maximum likelihood approach to infer demographic models

  • Chung, Yujin
    • Communications for Statistical Applications and Methods
    • /
    • 제27권3호
    • /
    • pp.385-395
    • /
    • 2020
  • We present a new maximum likelihood approach to estimate demographic history using genomic data sampled from two populations. A demographic model such as an isolation-with-migration (IM) model explains the genetic divergence of two populations split away from their common ancestral population. The standard probability model for an IM model contains a latent variable called genealogy that represents gene-specific evolutionary paths and links the genetic data to the IM model. Under an IM model, a genealogy consists of two kinds of evolutionary paths of genetic data: vertical inheritance paths (coalescent events) through generations and horizontal paths (migration events) between populations. The computational complexity of the IM model inference is one of the major limitations to analyze genomic data. We propose a fast maximum likelihood approach to estimate IM models from genomic data. The first step analyzes genomic data and maximizes the likelihood of a coalescent tree that contains vertical paths of genealogy. The second step analyzes the estimated coalescent trees and finds the parameter values of an IM model, which maximizes the distribution of the coalescent trees after taking account of possible migration events. We evaluate the performance of the new method by analyses of simulated data and genomic data from two subspecies of common chimpanzees in Africa.

유전체 코호트 연구의 윤리적 고려 사항 (Ethical Considerations in Genomic Cohort Study)

  • 최은경;김옥주
    • Journal of Preventive Medicine and Public Health
    • /
    • 제40권2호
    • /
    • pp.122-129
    • /
    • 2007
  • During the last decade, genomic cohort study has been developed in many countries by linking health data and genetic data in stored samples. Genomic cohort study is expected to find key genetic components that contribute to common diseases, thereby promising great advance in genome medicine. While many countries endeavor to build biobank systems, biobank-based genome research has raised important ethical concerns including genetic privacy, confidentiality, discrimination, and informed consent. Informed consent for biobank poses an important question: whether true informed consent is possible in population-based genomic cohort research where the nature of future studies is unforeseeable when consent is obtained. Due to the sensitive character of genetic information, protecting privacy and keeping confidentiality become important topics. To minimize ethical problems and achieve scientific goals to its maximum degree, each country strives to build population-based genomic cohort research project, by organizing public consultation, trying public and expert consensus in research, and providing safeguards to protect privacy and confidentiality.

유전체 코호트 연구의 주요 통계학적 과제 (Statistical Issues in Genomic Cohort Studies)

  • 박소희
    • Journal of Preventive Medicine and Public Health
    • /
    • 제40권2호
    • /
    • pp.108-113
    • /
    • 2007
  • When conducting large-scale cohort studies, numerous statistical issues arise from the range of study design, data collection, data analysis and interpretation. In genomic cohort studies, these statistical problems become more complicated, which need to be carefully dealt with. Rapid technical advances in genomic studies produce enormous amount of data to be analyzed and traditional statistical methods are no longer sufficient to handle these data. In this paper, we reviewed several important statistical issues that occur frequently in large-scale genomic cohort studies, including measurement error and its relevant correction methods, cost-efficient design strategy for main cohort and validation studies, inflated Type I error, gene-gene and gene-environment interaction and time-varying hazard ratios. It is very important to employ appropriate statistical methods in order to make the best use of valuable cohort data and produce valid and reliable study results.

한우의 유전체 표지인자 활용 개체 혈연관계 추정 (Prediction of Genomic Relationship Matrices using Single Nucleotide Polymorphisms in Hanwoo)

  • 이득환;조충일;김내수
    • Journal of Animal Science and Technology
    • /
    • 제52권5호
    • /
    • pp.357-366
    • /
    • 2010
  • 한우의 유전체 전장의 정보를 Illumina BeadArray$^{TM}$ Bovine SNP50 assay를 이용하여 단일염기다형 현상을 조사한 결과, 유전적 다양성을 보이는 좌위가 약 32,567 좌위 이상에서 다양성을 보이고 있었으며 약 5,554 좌위에서 다양성이 조사되지 않았다. 이는 조사된 자료의 가계집단의 수가 크게 제한되었기 때문에 기인될 수 있으며 또 다른 원인으로는 한우 종축집단의 크기가 작을 수 있다는 현상을 반증한다고 사료된다. 유전분석의 기초가 되는 혈통기록에 의한 개체간 혈연관계를 유전체 정보에 의한 혈연관계와 비교하여 본 결과, 유전체 정보에 의한 혈연관계의 크기가 혈통기록에 의한 혈연관계보다 좀 더 정확하게 추정될 수 있다는 장점이 있으며 혈통기록상의 오류로 그릇된 혈연관계의 크기를 유전체 정보를 통하여 보완할 수 있다는 장점이 있다. 이러한 장점을 활용하면 유전체정보를 이용한 유전능력 평가의 정확성을 크게 향상시킬 수 있을 것으로 사료되었다.

Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes

  • Linder, Henry;Zhang, Yuping
    • Communications for Statistical Applications and Methods
    • /
    • 제26권4호
    • /
    • pp.411-430
    • /
    • 2019
  • Tumor development is driven by complex combinations of biological elements. Recent advances suggest that molecularly distinct subtypes of breast cancers may respond differently to pathway-targeted therapies. Thus, it is important to dissect pathway disturbances by integrating multiple molecular profiles, such as genetic, genomic and epigenomic data. However, missing data are often present in the -omic profiles of interest. Motivated by genomic data integration and imputation, we present a new statistical framework for pathway significance analysis. Specifically, we develop a new strategy for imputation of missing data in large-scale genomic studies, which adapts low-rank, structured matrix completion. Our iterative strategy enables us to impute missing data in complex configurations across multiple data platforms. In turn, we perform large-scale pathway analysis integrating gene expression, copy number, and methylation data. The advantages of the proposed statistical framework are demonstrated through simulations and real applications to breast cancer subtypes. We demonstrate superior power to identify pathway disturbances, compared with other imputation strategies. We also identify differential pathway activity across different breast tumor subtypes.

One Step Cloning of Defined DNA Fragments from Large Genomic Clones

  • Scholz, Christian;Doderlein, Gabriele;Simon, Horst H.
    • BMB Reports
    • /
    • 제39권4호
    • /
    • pp.464-467
    • /
    • 2006
  • Recently, the nucleotide sequences of entire genomes became available. This information combined with older sequencing data discloses the exact chromosomal location of millions of nucleotide markers stored in the databases at NCBI, EMBO or DDBJ. Despite having resolved the intron/exon structures of all described genes within these genomes with a stroke of a pen, the sequencing data opens up other interesting possibilities. For example, the genomic mapping of the end sequences of the human, murine and rat BAC libraries generated at The Institute for Genomic Research (TIGR), reveals now the entire encompassed sequence of the inserts for more than a million of these clones. Since these clones are individually stored, they are now an invaluable source for experiments which depend on genomic DNA. Isolation of smaller fragments from such clones with standard methods is a time consuming process. We describe here a reliable one-step cloning technique to obtain a DNA fragment with a defined size and sequence from larger genomic clones in less than 48 hours using a standard vector with a multiple cloning site, and common restriction enzymes and equipment. The only prerequisites are the sequences of ends of the insert and of the underlying genome.

BaSDAS: a web-based pooled CRISPR-Cas9 knockout screening data analysis system

  • Park, Young-Kyu;Yoon, Byoung-Ha;Park, Seung-Jin;Kim, Byung Kwon;Kim, Seon-Young
    • Genomics & Informatics
    • /
    • 제18권4호
    • /
    • pp.46.1-46.4
    • /
    • 2020
  • We developed the BaSDAS (Barcode-Seq Data Analysis System), a GUI-based pooled knockout screening data analysis system, to facilitate the analysis of pooled knockout screen data easily and effectively by researchers with limited bioinformatics skills. The BaSDAS supports the analysis of various pooled screening libraries, including yeast, human, and mouse libraries, and provides many useful statistical and visualization functions with a user-friendly web interface for convenience. We expect that BaSDAS will be a useful tool for the analysis of genome-wide screening data and will support the development of novel drugs based on functional genomics information.

A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages

  • Park, Seung-Jin;Kim, Jong-Hwan;Yoon, Byung-Ha;Kim, Seon-Young
    • Genomics & Informatics
    • /
    • 제15권1호
    • /
    • pp.11-18
    • /
    • 2017
  • Nowadays, huge volumes of chromatin immunoprecipitation-sequencing (ChIP-Seq) data are generated to increase the knowledge on DNA-protein interactions in the cell, and accordingly, many tools have been developed for ChIP-Seq analysis. Here, we provide an example of a streamlined workflow for ChIP-Seq data analysis composed of only four packages in Bioconductor: dada2, QuasR, mosaics, and ChIPseeker. 'dada2' performs trimming of the high-throughput sequencing data. 'QuasR' and 'mosaics' perform quality control and mapping of the input reads to the reference genome and peak calling, respectively. Finally, 'ChIPseeker' performs annotation and visualization of the called peaks. This workflow runs well independently of operating systems (e.g., Windows, Mac, or Linux) and processes the input fastq files into various results in one run. R code is available at github: https://github.com/ddhb/Workflow_of_Chipseq.git.

Non-Synteny Regions in the Human Genome

  • Lee, Ki-Chan;Kim, Sang-Soo
    • Genomics & Informatics
    • /
    • 제8권2호
    • /
    • pp.86-89
    • /
    • 2010
  • Closely related species share large genomic segments called syntenic regions, where the genomic elements such as genes are arranged co-linearly among the species. While synteny is an important criteria in establishing orthologous regions between species, non-syntenic regions may display species-specific features. As the first step in cataloging human- or primate- specific genomic elements, we surveyed human genomic regions that are not syntenic with any other non-primate mammalian genomes sequenced so far. Based on the data compiled in Ensembl databases, we were able to identify 10 such regions located in eight different human chromosomes. Interestingly, most of these highly human- or primate- specific loci are concentrated in subtelomeric or pericentromeric regions. It has been reported that subtelomeric regions in human chromosomes are highly plastic and filled with recently shuffled genomic elements. Pericentromeric regions also show a great deal of segmental duplications. Such genomic rearrangements may have caused these large human- or primate- specific genome segments.