• Title/Summary/Keyword: sequence databases

Search Result 224, Processing Time 0.021 seconds

Concentric Circle-Based Image Signature for Near-Duplicate Detection in Large Databases

  • Cho, A-Young;Yang, Won-Keun;Oh, Weon-Geun;Jeong, Dong-Seok
    • ETRI Journal
    • /
    • v.32 no.6
    • /
    • pp.871-880
    • /
    • 2010
  • Many applications dealing with image management need a technique for removing duplicate images or for grouping related (near-duplicate) images in a database. This paper proposes a concentric circle-based image signature which makes it possible to detect near-duplicates rapidly and accurately. An image is partitioned by radius and angle levels from the center of the image. Feature values are calculated using the average or variation between the partitioned sub-regions. The feature values distributed in sequence are formed into an image signature by hash generation. The hashing facilitates storage space reduction and fast matching. The performance was evaluated through discriminability and robustness tests. Using these tests, the particularity among the different images and the invariability among the modified images are verified, respectively. In addition, we also measured the discriminability and robustness by the distribution analysis of the hashed bits. The proposed method is robust to various modifications, as shown by its average detection rate of 98.99%. The experimental results showed that the proposed method is suitable for near-duplicate detection in large databases.

Video Index Generation and Search using Trie Structure (Trie 구조를 이용한 비디오 인덱스 생성 및 검색)

  • 현기호;김정엽;박상현
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.7_8
    • /
    • pp.610-617
    • /
    • 2003
  • Similarity matching in video database is of growing importance in many new applications such as video clustering and digital video libraries. In order to provide efficient access to relevant data in large databases, there have been many research efforts in video indexing with diverse spatial and temporal features. however, most of the previous works relied on sequential matching methods or memory-based inverted file techniques, thus making them unsuitable for a large volume of video databases. In order to resolve this problem, this paper proposes an effective and scalable indexing technique using a trie, originally proposed for string matching, as an index structure. For building an index, we convert each frame into a symbol sequence using a window order heuristic and build a disk-resident trie from a set of symbol sequences. For query processing, we perform a depth-first search on the trie and execute a temporal segmentation. To verify the superiority of our approach, we perform several experiments with real and synthetic data sets. The results reveal that our approach consistently outperforms the sequential scan method, and the performance gain is maintained even with a large volume of video databases.

n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure (n-gram/2L: 공간 및 시간 효율적인 2단계 n-gram 역색인 구조)

  • Kim Min-Soo;Whang Kyu-Young;Lee Jae-Gil;Lee Min-Jae
    • Journal of KIISE:Databases
    • /
    • v.33 no.1
    • /
    • pp.12-31
    • /
    • 2006
  • The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and Protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the Performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9${\~}$2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index.

Expressed Sequence Tag Analysis of the Erythrocytic Stage of Plasmodium berghei

  • Seok, Ji-Woong;Lee, Yong-Seok;Moon, Eun-Kyung;Lee, Jung-Yub;Jha, Bijay Kumar;Kong, Hyun-Hee;Chung, Dong-Il;Hong, Yeon-Chul
    • Parasites, Hosts and Diseases
    • /
    • v.49 no.3
    • /
    • pp.221-228
    • /
    • 2011
  • Rodent malaria parasites, such as Plasmodium berghei, are practical and useful model organisms for human malaria research because of their analogies to the human malaria in terms of structure, physiology, and life cycle. Exploiting the available genetic sequence information, we constructed a cDNA library from the erythrocytic stages of P. berghei and analyzed the expressed sequence tag (EST). A total of 10,040 ESTs were generated and assembled into 2,462 clusters. These EST clusters were compared against public protein databases and 48 putative new transcripts, most of which were hypothetical proteins with unknown function, were identified. Genes encoding ribosomal or membrane proteins and purine nucleotide phosphorylases were highly abundant clusters in P. berghei. Protein domain analyses and the Gene Ontology functional categorization revealed translation/protein folding, metabolism, protein degradation, and multiple family of variant antigens to be mainly prevalent. The presently-collected ESTs and its bioinformatic analysis will be useful resources to identify for drug target and vaccine candidates and validate gene predictions of P. berghei.

Efficient Time-Series Subsequence Matching Using MBR-Safe Property of Piecewise Aggregation Approximation (부분 집계 근사법의 MBR-안전 성질을 이용한 효율적인 시계열 서브시퀀스 매칭)

  • Moon, Yang-Sae
    • Journal of KIISE:Databases
    • /
    • v.34 no.6
    • /
    • pp.503-517
    • /
    • 2007
  • In this paper we address the MBR-safe property of Piecewise Aggregation Approximation(PAA), and propose an of efficient subsequence matching method based on the MBR-safe PAA. A transformation is said to be MBR-safe if a low-dimensional MBR to which a high- dimensional MBR is transformed by the transformation contains every individual low-dimensional sequence to which a high-dimensional sequence is transformed. Using an MBR-safe transformation we can reduce the number of lower-dimensional transformations required in similar sequence matching, since it transforms a high-dimensional MBR itself to a low-dimensional MBR directly. Furthermore, PAA is known as an excellent lower-dimensional transformation single its computation is very simple, and its performance is superior to other transformations. Thus, to integrate these advantages of PAA and MBR-safeness, we first formally confirm the MBR-safe property of PAA, and then improve subsequence matching performance using the MBR-safe PAA. Contributions of the paper can be summarized as follows. First, we propose a PAA-based MBR-safe transformation, called mbrPAA, and formally prove the MBR-safeness of mbrPAA. Second, we propose an mbrPAA-based subsequence matching method, and formally prove its correctness of the proposed method. Third, we present the notion of entry reuse property, and by using the property, we propose an efficient method of constructing high-dimensional MBRs in subsequence matching. Fourth, we show the superiority of mbrPAA through extensive experiments. Experimental results show that, compared with the previous approach, our mbrPAA is 24.2 times faster in the low-dimensional MBR construction and improves subsequence matching performance by up to 65.9%.

An Effective Similarity Search Technique supporting Time Warping in Sequence Databases (시퀀스 데이타베이스에서 타임 워핑을 지원하는 효과적인 유살 검색 기법)

  • Kim, Sang-Wook;Park, Sang-Hyun
    • Journal of KIISE:Databases
    • /
    • v.28 no.4
    • /
    • pp.643-654
    • /
    • 2001
  • This paper discusses an effective processing of similarity search that supports time warping in large sequence database. Time warping enables finding sequences with similar patterns even when they are of different length, Previous methods fail to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan all the database, thus suffer from serious performance degradation in large database. Another method that hires the suffix tree also shows poor performance due to the large tree size. In this paper we propose a new novel method for similarity search that supports time warping Our primary goal is to innovate on search performance in large database without false dismissal. to attain this goal ,we devise a new distance function $D_{tw-Ib}$ consistently underestimates the time warping distance and also satisfies the triangular inequality, $D_{tw-Ib}$ uses a 4-tuple feature vector extracted from each sequence and is invariant to time warping, For efficient processing, we employ a distance function, We prove that our method does not incur false dismissal. To verify the superiority of our method, we perform extensive experiments . The results reveal that our method achieves significant speedup up to 43 times with real-world S&P 500 stock data and up to 720 times with very large synthetic data.

  • PDF

Structure function relationships amongst the purple acid phosphatase family of binuclear metal-containing enzymes

  • Hamilton, Susan
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2003.10a
    • /
    • pp.5-5
    • /
    • 2003
  • The purple acid phosphatases comprise a family of binuclear metal-containing enzymes. The metal centre contains one ferric ion and one divalent metal ion. Spectroscopic studies of the monomeric, ${\sim}$36 kDa mammalian purple acid phosphatases reveal the presence of an Fe(III)Fe(II) centre in which the metals are weakly antiferromagnetically coupled, whereas the dimeric, ${\sim}$110 000 kDa plant enzymes contain either Fe(III)Zn(II) or Fe(III)Mn(II). The three dimensional structures of the red kidney bean and pig enzymes show very similar arrangements of the metal ligands but some significant differences beyond the immediate vicinity of the metals. In addition to the catalytic domain, the plant enzyme contains a second domain of unknown function. A search of sequence databases was undertaken using a sequence pattern which includes the conserved metal-binding residues in the plant and animal enzymes. The search revealed the presence in plants of a 'mammalian-type' low molecular weight purple acid phosphatase, a high molecular weight form in some fungi, and a homologue in some bacteria. The catalytic mechanism of the enzyme has been investigated with a view to understanding the marked difference in specificity between the Fe-Mn sweet potato enzyme, which exhibits highly efficient catalysis towards both activated and unactivated phosphate esters, and other PAPs, which hydrolyse only activated esters. Comparison of the active site structures of the enzymes reveal some interesting differences between them which may account for the difference. The implications fur understanding the physiological functions of the enzymes will be discussed.

  • PDF

One Step Cloning of Defined DNA Fragments from Large Genomic Clones

  • Scholz, Christian;Doderlein, Gabriele;Simon, Horst H.
    • BMB Reports
    • /
    • v.39 no.4
    • /
    • pp.464-467
    • /
    • 2006
  • Recently, the nucleotide sequences of entire genomes became available. This information combined with older sequencing data discloses the exact chromosomal location of millions of nucleotide markers stored in the databases at NCBI, EMBO or DDBJ. Despite having resolved the intron/exon structures of all described genes within these genomes with a stroke of a pen, the sequencing data opens up other interesting possibilities. For example, the genomic mapping of the end sequences of the human, murine and rat BAC libraries generated at The Institute for Genomic Research (TIGR), reveals now the entire encompassed sequence of the inserts for more than a million of these clones. Since these clones are individually stored, they are now an invaluable source for experiments which depend on genomic DNA. Isolation of smaller fragments from such clones with standard methods is a time consuming process. We describe here a reliable one-step cloning technique to obtain a DNA fragment with a defined size and sequence from larger genomic clones in less than 48 hours using a standard vector with a multiple cloning site, and common restriction enzymes and equipment. The only prerequisites are the sequences of ends of the insert and of the underlying genome.

Analysis of Expressed Sequence Tags from the Wood-Decaying Fungus Fomitopsis palustris and Identification of Potential Genes Involved in the Decay Process

  • Karim, Nurul;Shibuya, Hajime;Kikuchi, Taisei
    • Journal of Microbiology and Biotechnology
    • /
    • v.21 no.4
    • /
    • pp.347-358
    • /
    • 2011
  • Fomitopsis palustris, a brown-rot basidiomycete, causes the most destructive type of decay in wooden structures. In spite of its great economic importance, very little information is available at the molecular level regarding its complex decay process. To address this, we generated over 3,000 expressed sequence tags (ESTs) from a cDNA library constructed from F. palustris. Clustering of 3,095 high-quality ESTs resulted in a set of 1,403 putative unigenes comprising 485 contigs and 918 singlets. Homology searches based on BlastX analysis revealed that 78% of the F. palustris unigenes had a significant match to proteins deposited in the nonredundant databases. A subset of F. palustris unigenes showed similarity to the carbohydrateactive enzymes (CAZymes), including a range of glycosyl hydrolase (GH) family proteins. Some of these CAZyme-encoded genes were previously undescribed for F. palustris but predicted to have potential roles in biodegradation of wood. Among them, we identified and characterized a gene (FpCel45A) encoding the GH family 45 endoglucanase. Moreover, we also provided functional classification of 473 (34%) of F. palustris unigenes using the Gene Ontology hierarchy. The annotated EST data sets and related analysis may be useful in providing an initial insight into the genetic background of F. palustris.

Transcriptome analysis of the short-term photosynthetic sea slug Placida dendritica

  • Han, Ji Hee;Klochkova, Tatyana A.;Han, Jong Won;Shim, Junbo;Kim, Gwang Hoon
    • ALGAE
    • /
    • v.30 no.4
    • /
    • pp.303-312
    • /
    • 2015
  • The intimate physical interaction between food algae and sacoglossan sea slug is a pertinent system to test the theory that “you are what you eat.” Some sacoglossan mollusks ingest and maintain chloroplasts that they acquire from the algae for photosynthesis. The basis of photosynthesis maintenance in these sea slugs was often explained by extensive horizontal gene transfer (HGT) from the food algae to the animal nucleus. Two large-scale expressed sequence tags databases of the green alga Bryopsis plumosa and sea slug Placida dendritica were established using 454 pyrosequencing. Comparison of the transcriptomes showed no possible case of putative HGT, except an actin gene from P. dendritica, designated as PdActin04, which showed 98.9% identity in DNA sequence with the complementary gene from B. plumosa, BpActin03. Highly conserved homologues of this actin gene were found from related green algae, but not in other photosynthetic sea slugs. Phylogenetic analysis showed incongruence between the gene and known organismal phylogenies of the two species. Our data suggest that HGT is not the primary reason underlying the maintenance of short-term kleptoplastidy in Placida dendritica.