• Title/Summary/Keyword: cd-hit-est

Search Result 1, Processing Time 0.015 seconds

The Analysis of Genome Database Compaction based on Sequence Similarity (시퀀스 유사도에 기반한 유전체 데이터베이스 압축 및 영향 분석)

  • Kwon, Sunyoung;Lee, Byunghan;Park, Seunghyun;Jo, Jeonghee;Yoon, Sungroh
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.4
    • /
    • pp.250-255
    • /
    • 2017
  • Given the explosion of genomic data and expansion of applications such as precision medicine, the importance of efficient genome-database management continues to grow. Traditional compression techniques may be effective in reducing the size of a database, but a new challenge follows in terms of performing operations such as comparison and searches on the compressed database. Based on that many genome databases typically have numerous duplicated or similar sequences, and that the runtime of genome analyses is normally proportional to the number of sequences in a database, we propose a technique that can compress a genome database by eliminating similar entries from the database. Through our experiments, we show that we can remove approximately 84% of sequences with 1% similarity threshold, accelerating the downstream classification tasks by approximately 10 times. We also confirm that our compression method does not significantly affect the accuracy of taxonomy diversity assessments or classification.