• 제목/요약/키워드: Sequence database

Search Result 567, Processing Time 0.025 seconds

A Novel Approach for Mining High-Utility Sequential Patterns in Sequence Databases

  • Ahmed, Chowdhury Farhan;Tanbeer, Syed Khairuzzaman;Jeong, Byeong-Soo
    • ETRI Journal
    • /
    • v.32 no.5
    • /
    • pp.676-686
    • /
    • 2010
  • Mining sequential patterns is an important research issue in data mining and knowledge discovery with broad applications. However, the existing sequential pattern mining approaches consider only binary frequency values of items in sequences and equal importance/significance values of distinct items. Therefore, they are not applicable to actually represent many real-world scenarios. In this paper, we propose a novel framework for mining high-utility sequential patterns for more real-life applicable information extraction from sequence databases with non-binary frequency values of items in sequences and different importance/significance values for distinct items. Moreover, for mining high-utility sequential patterns, we propose two new algorithms: UtilityLevel is a high-utility sequential pattern mining with a level-wise candidate generation approach, and UtilitySpan is a high-utility sequential pattern mining with a pattern growth approach. Extensive performance analyses show that our algorithms are very efficient and scalable for mining high-utility sequential patterns.

The Analysis of Genome Database Compaction based on Sequence Similarity (시퀀스 유사도에 기반한 유전체 데이터베이스 압축 및 영향 분석)

  • Kwon, Sunyoung;Lee, Byunghan;Park, Seunghyun;Jo, Jeonghee;Yoon, Sungroh
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.4
    • /
    • pp.250-255
    • /
    • 2017
  • Given the explosion of genomic data and expansion of applications such as precision medicine, the importance of efficient genome-database management continues to grow. Traditional compression techniques may be effective in reducing the size of a database, but a new challenge follows in terms of performing operations such as comparison and searches on the compressed database. Based on that many genome databases typically have numerous duplicated or similar sequences, and that the runtime of genome analyses is normally proportional to the number of sequences in a database, we propose a technique that can compress a genome database by eliminating similar entries from the database. Through our experiments, we show that we can remove approximately 84% of sequences with 1% similarity threshold, accelerating the downstream classification tasks by approximately 10 times. We also confirm that our compression method does not significantly affect the accuracy of taxonomy diversity assessments or classification.

Theoretical Peptide Mass Distribution in the Non-Redundant Protein Database of the NCBI

  • Lim Da-Jeong;Oh Hee-Seok;Kim Hee-Bal
    • Genomics & Informatics
    • /
    • v.4 no.2
    • /
    • pp.65-70
    • /
    • 2006
  • Peptide mass mapping is the matching of experimentally generated peptides masses with the predicted masses of digested proteins contained in a database. To identify proteins by matching their constituent fragment masses to the theoretical peptide masses generated from a protein database, the peptide mass fingerprinting technique is used for the protein identification. Thus, it is important to know the theoretical mass distribution of the database. However, few researches have reported the peptide mass distribution of a database. We analyzed the peptide mass distribution of non-redundant protein sequence database in the NCBI after digestion with 15 different types of enzymes. In order to characterize the peptide mass distribution with different digestion enzymes, a power law distribution (Zipfs law) was applied to the distribution. After constructing simulated digestion of a protein database, rank-frequency plot of peptide fragments was applied to generalize a Zipfs law curve for all enzymes. As a result, our data appear to fit Zipfs law with statistically significant parameter values.

Nucleotide sequence analysis of the 5S ribosomal RNA gene of the mushroom tricholoma matsutake

  • Hwang, Seon-Kap;Kim, Jong-Guk
    • Journal of Microbiology
    • /
    • v.33 no.2
    • /
    • pp.136-141
    • /
    • 1995
  • From a cluster of structural rRNA genes which has previsouly been cloned (Hwang and Kim, in submission; J. Microbiol. Biotechnol.), a 1.0-kb Eco RI fragment of DNA which shows significant homology to the 25S and rRNA s of Tricholoma matsutake was used for sequence analysis. Nucleotide sequence was bidirectionally determined using delection series of the DNA fragment. Comparing the resultant 1016-base sequence with sequences in the database, both the 3'end of 25S-rRNA gene and 5S rRNA gene were searched. The 5S rRNA gene is 118-bp in length and is located 158-bp downstream of 3'end of the 25S rRNA gene. IGSI and IGS2 (partial) sequences are also contained in the fragment. Multiple alignment of the 5S rRNA sequences was carried out with 5S rRNA sequences from some members of the subdivision Basidiomycotina obtained from the database. Polygenetic analysis with distance matrix established by Kimura's 2-parameter method and phylogenetic tree by UPGMA method proposed that T. matsutake is closely related to efibulobasidium allbescens. Secondary structure of 5S rRNA was also hypothesized to show similar topology with its generally accepted eukaryotic counterpart.

  • PDF

Protein Sequence Search based on N-gram Indexing

  • Hwang, Mi-Nyeong;Kim, Jin-Suk
    • Bioinformatics and Biosystems
    • /
    • v.1 no.1
    • /
    • pp.46-50
    • /
    • 2006
  • According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

  • PDF

PC-Based Hybrid Grid Computing for Huge Biological Data Processing

  • Cho, Wan-Sup;Kim, Tae-Kyung;Na, Jong-Hwa
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.2
    • /
    • pp.569-579
    • /
    • 2006
  • Recently, the amount of genome sequence is increasing rapidly due to advanced computational techniques and experimental tools in the biological area. Sequence comparisons are very useful operations to predict the functions of the genes or proteins. However, it takes too much time to compare long sequence data and there are many research results for fast sequence comparisons. In this paper, we propose a hybrid grid system to improve the performance of the sequence comparisons based on the LanLinux system. Compared with conventional approaches, hybrid grid is easy to construct, maintain, and manage because there is no need to install SWs for every node. As a real experiment, we constructed an orthologous database for 89 prokaryotes just in a week under hybrid grid; note that it requires 33 weeks on a single computer.

  • PDF

Physical Database Design for DFT-Based Multidimensional Indexes in Time-Series Databases (시계열 데이터베이스에서 DFT-기반 다차원 인덱스를 위한 물리적 데이터베이스 설계)

  • Kim, Sang-Wook;Kim, Jin-Ho;Han, Byung-ll
    • Journal of Korea Multimedia Society
    • /
    • v.7 no.11
    • /
    • pp.1505-1514
    • /
    • 2004
  • Sequence matching in time-series databases is an operation that finds the data sequences whose changing patterns are similar to that of a query sequence. Typically, sequence matching hires a multi-dimensional index for its efficient processing. In order to alleviate the dimensionality curse problem of the multi-dimensional index in high-dimensional cases, the previous methods for sequence matching apply the Discrete Fourier Transform(DFT) to data sequences, and take only the first two or three DFT coefficients as organizing attributes of the multi-dimensional index. This paper first points out the problems in such simple methods taking the firs two or three coefficients, and proposes a novel solution to construct the optimal multi -dimensional index. The proposed method analyzes the characteristics of a target database, and identifies the organizing attributes having the best discrimination power based on the analysis. It also determines the optimal number of organizing attributes for efficient sequence matching by using a cost model. To show the effectiveness of the proposed method, we perform a series of experiments. The results show that the Proposed method outperforms the previous ones significantly.

  • PDF

Trend and Technology of Gene and Genome Research (유전자 및 유전체 연구 기술과 동향)

  • 이진성;김기환;서동상;강석우;황재삼
    • Journal of Sericultural and Entomological Science
    • /
    • v.42 no.2
    • /
    • pp.126-141
    • /
    • 2000
  • A major step towards understanding of the genetic basis of an organism is the complete sequence determination of all genes in target genome. The nucleotide sequence encoded in the genome contains the information that specifies the amino acid sequence of every protein and functional RNA molecule. In principle, it will be possible to identify every protein resposible for the structure and function of the body of the target organism. The pattern of expression in different cell types will specify where and when each protein is used. The amino acid sequence of the proteins encoded by each gene will be derived from the conceptional translation of the nucleotide sequence. Comparison of these sequences with those of known proteins, whose sequences are sorted in database, will suggest an approximate function for many proteins. This mini review describes the development of new sequencing methods and the optimization of sequencing strategies for whole genome, various cDNA and genomic analysis.

  • PDF

Effective Biological Sequence Alignment Method using Divide Approach

  • Choi, Hae-Won;Kim, Sang-Jin;Pi, Su-Young
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.6
    • /
    • pp.41-50
    • /
    • 2012
  • This paper presents a new sequence alignment method using the divide approach, which solves the problem by decomposing sequence alignment into several sub-alignments with respect to exact matching subsequences. Exact matching subsequences in the proposed method are bounded on the generalized suffix tree of two sequences, such as protein domain length more than 7 and less than 7. Experiment results show that protein sequence pairs chosen in PFAM database can be aligned using this method. In addition, this method reduces the time about 15% and space of the conventional dynamic programming approach. And the sequences were classified with 94% of accuracy.