한국생물정보학회:학술대회논문집 (Proceedings of the Korean Society for Bioinformatics Conference) (Proceedings of the Korean Society for Bioinformatics Conference)
한국생명정보학회 (Korean Society for Bioinformatics)
- 기타
한국생물정보시스템생물학회 2003년도 제2차 연례학술대회 발표논문집
-
It is clear that computers will play a key role in the biology of the future. Even now, it is virtually impossible to keep track of the key proteins, their names and associated gene names, physical constants(e.g. binding constants, reaction constants, etc.), and hewn physical and genetic interactions without computational assistance. In this sense, computers act as an auxiliary brain, allowing one to keep track of thousands of complex molecules and their interactions. With the advent of gene expression array technology, many experiments are simply impossible without this computer assistance. In the future, as we seek to integrate the reductionist description of life provided by genomic sequencing into complex and sophisticated models of living systems, computers will play an increasingly important role in both analyzing data and generating experimentally testable hypotheses. The future of bioinformatics is thus being driven by potent technological and scientific forces. On the technological side, new experimental technologies such as microarrays, protein arrays, high-throughput expression and three-dimensional structure determination prove rapidly increasing amounts of detailed experimental information on a genomic scale. On the computational side, faster computers, ubiquitous computing systems, high-speed networks provide a powerful but rapidly changing environment of potentially immense power. The challenges we face are enormous: How do we create stable data resources when both the science and computational technology change rapidly? How do integrate and synthesize information from many disparate subdisciplines, each with their own vocabulary and viewpoint? How do we 'liberate' the scientific literature so that it can be incorporated into electronic resources? How do we take advantage of advances in computing and networking to build the international infrastructure needed to support a complete understanding of biological systems. The seeds to the solutions of these problems exist, at least partially, today. These solutions emphasize ubiquitous high-speed computation, database interoperation, federation, and integration, and the development of research networks that capture scientific knowledge rather than just the ABCs of genomic sequence. 1 will discuss a number of these solutions, with examples from existing resources, as well as area where solutions do not currently exist with a view to defining what bioinformatics and biology will look like in the future.
-
With biomedical literature expanding so rapidly, there is an urgent need to discover and organize knowledge extracted from texts. Although factual databases contain crucial information the overwhelming amount of new knowledge remains in textual form (e.g. MEDLINE). In addition, new terms are constantly coined as the relationships linking new genes, drugs, proteins etc. As the size of biomedical literature is expanding, more systems are applying a variety of methods to automate the process of knowledge acquisition and management. In my talk, I focus on the project, GENIA, of our group at the University of Tokyo, the objective of which is to construct an information extraction system of protein - protein interaction from abstracts of MEDLINE. The talk includes (1) Techniques we use fDr named entity recognition (1-a) SOHMM (Self-organized HMM) (1-b) Maximum Entropy Model (1-c) Lexicon-based Recognizer (2) Treatment of term variants and acronym finders (3) Event extraction using a full parser (4) Linguistic resources for text mining (GENIA corpus) (4-a) Semantic Tags (4-b) Structural Annotations (4-c) Co-reference tags (4-d) GENIA ontology I will also talk about possible extension of our work that links the findings of molecular biology with clinical findings, and claim that textual based or conceptual based biology would be a viable alternative to system biology that tends to emphasize the role of simulation models in bioinformatics.
-
A large number of in-vitro experiments on the inhibition of kinases and pretenses are reported in literature, and compiled by ProLINT database. Using this powerful wealth of knowledge, we have carried our an analysis of ligand specificity of these two classes of proteins. Each of the pretenses and kinases included in the database has been assigned a consensus ligand fragment signature, based on the available information about its interaction with different ligands. A set of 43 fragments efficiently represent every ligand. We have then organized the consensus fragment signatures for every protein in form of a cluster-tree diagram. This tree is also constructed from other sequence, structure and physical considerations. Cluster-cluster comparison between these analyzes provide a valuable information about ligand specific interactions and similarities between proteins.
-
The purple acid phosphatases comprise a family of binuclear metal-containing enzymes. The metal centre contains one ferric ion and one divalent metal ion. Spectroscopic studies of the monomeric,
${\sim}$ 36 kDa mammalian purple acid phosphatases reveal the presence of an Fe(III)Fe(II) centre in which the metals are weakly antiferromagnetically coupled, whereas the dimeric,${\sim}$ 110 000 kDa plant enzymes contain either Fe(III)Zn(II) or Fe(III)Mn(II). The three dimensional structures of the red kidney bean and pig enzymes show very similar arrangements of the metal ligands but some significant differences beyond the immediate vicinity of the metals. In addition to the catalytic domain, the plant enzyme contains a second domain of unknown function. A search of sequence databases was undertaken using a sequence pattern which includes the conserved metal-binding residues in the plant and animal enzymes. The search revealed the presence in plants of a 'mammalian-type' low molecular weight purple acid phosphatase, a high molecular weight form in some fungi, and a homologue in some bacteria. The catalytic mechanism of the enzyme has been investigated with a view to understanding the marked difference in specificity between the Fe-Mn sweet potato enzyme, which exhibits highly efficient catalysis towards both activated and unactivated phosphate esters, and other PAPs, which hydrolyse only activated esters. Comparison of the active site structures of the enzymes reveal some interesting differences between them which may account for the difference. The implications fur understanding the physiological functions of the enzymes will be discussed. -
In this paper, we propose a probabilistic framework to predict the interaction probability of proteins. The notion of domain combination and domain combination pair is newly introduced and the prediction model in the framework takes domain combination pair as a basic unit of protein interactions to overcome the limitations of the conventional domain pair based prediction systems. The framework largely consists of prediction preparation and service stages. In the prediction preparation stage, two appearance pro-bability matrices, which hold information on appearance frequencies of domain combination pairs in the interacting and non-interacting sets of protein pairs, are constructed. Based on the appearance probability matrix, a probability equation is devised. The equation maps a protein pair to a real number in the range of 0 to 1. Two distributions of interacting and non-interacting set of protein pairs are obtained using the equation. In the prediction service stage, the interaction probability of a protein pair is predicted using the distributions and the equation. The validity of the prediction model is evaluated fur the interacting set of protein pairs in Yeast organism and artificially generated non-interacting set of protein pairs. When 80% of the set of interacting protein pairs in DIP database are used as foaming set of interacting protein pairs, very high sensitivity(86%) and specificity(56%) are achieved within our framework.
-
Determining the binding sites in protein-nucleic acid complexes is essential to the complete understanding of protein-nucleic acid interactions and to the development of new drugs. We have developed a set of algorithms for analyzing protein-nucleic acid interactions and for predicting potential binding sites in protein-nucleic acid complexes. The algorithms were used to analyze the hydrogen-bonding interactions in protein-RNA and protein-DNA complexes. The analysis was done both at the atomic and residue level, and discovered several interesting interaction patterns and differences between the two types of nucleic acids. The interaction patterns were used for predicting potential binding sites in new protein-RNA complexes.
-
Large scale protein interaction maps provide a new, global perspective with which to analyse protein function. PSIMAP, the Protein Structural Interactome Map, is a database of all the structurally observed interactions between superfamilies of protein domains with known three-dimensional structure in thePDB. PSIMAP incorporates both functional and evolutionary information into a single network. It makes it possible to age protein domains in terms of taxonomic diversity, interaction and function. One consequence of it is to predict the most important protein domain structure in evolution. We present a global analysis of PSIMAP using several distinct network measures relating to centrality, interactivity, fault-tolerance, and taxonomic diversity. We found the following results:
${\bullet}$ Centrality: we show that the center and barycenter of PSIMAP do not coincide, and that the superfamilies forming the barycenter relate to very general functions, while those constituting the center relate to enzymatic activity.${\bullet}$ Interactivity: we identify the P-loop and immunoglobulin superfamilies as the most highly interactive. We successfully use connectivity and cluster index, which characterise the connectivity of a superfamily's neighbourhood, to discover superfamilies of complex I and II. This is particularly significant as the structure of complex I is not yet solved.${\bullet}$ Taxonomic diversity: we found that highly interactive superfamilies are in general taxonomically very diverse and are thus amongst the oldest. This led to the prediction of the oldest and most important protein domain in evolution of lift.${\bullet}$ Fault-tolerance: we found that the network is very robust as for the majority of superfamilies removal from the network will not break up the network. Overall, we can single out the P-loop containing nucleotide triphosphate hydrolases superfamily as it is the most highly connected and has the highest taxonomic diversity. In addition, this superfamily has the highest interaction rank, is the barycenter of the network (it has the shortest average path to every other superfamily in the network), and is an articulation vertex, whose removal will disconnect the network. More generally, we conclude that the graph-theoretic and taxonomic analysis of PSIMAP is an important step towards the understanding of protein function and could be an important tool for tracing the evolution of life at the molecular level. -
생명과학 관련 문서에서의 이벤트 추출은 관련 연구자들의 연구에 많은 도움을 줄 수 있다. 기존의 연구에서는 주로 이벤트 동사에 대해 패턴을 정의한 후에 정의된 패턴에 의해서만 이벤트를 추출하고자하였다. 그러나 모든 패턴을 수동으로 정의하는 것은 너무 많은 비용이 들기 때문에 패턴을 자동 추출 또는 확장하는 방법이 필요하다. 또한 학습을 하기 위해서는 상당수의 학습 말뭉치가 있어야 하는데 그것 또한 충분하지 않은 실정이다. 본 논문에서는 초기 패턴에 의해 생성된 소량의 정답 이벤트로부터 학습한 후 공기정보와 패턴정보를 이용한 Co-training방법으로 패턴 확장 및 이벤트 추출을 시도하였다. 실험 결과, 이벤트 동사의 패턴 정보가 유용한 정보라는 것을 확인할 수 있었고, 후보 이벤트 내의 개체간 공기정보와 문법관계정보 또한 매우 중요한 정보라는 것을 새롭게 보일 수 있었다. GENIA 말뭉치에서 162개의 이벤트 동사에 대해 실험한 결과, 88.02%의 정확률, 79.25%의 재현율을 얻었다.
-
암의 성장을 정확하게 예측할 수 있다면 암으로 고통 받는 많은 사람들에게 적절한 치료 및 처방을 내릴 수 있을 것이다. 그 동안 암의 성장을 예측하기 위하여 많은 연구가 진행되어왔는데 크게 나누어 하향식 설계 방법과 상향식 설계 방법이 있다. 하향식 방법은 전체적인 흐름을 파악하기는 쉽지만, 지역적 특성을 고려하기 어렵다는 단점이 있고, 상향식 방법은 지역적 특성은 고려하기 쉽지만, 전체적인 흐름을 파악하기 어려운 단점이 있다. 본 논문에서는 두 가지 방법을 혼합한 방법을 사용하여 지역적으로는 불규칙적인 암의 성장 모습과 암이 다른 조직으로 전이되는 모습을 동시에 관찰 할 수 있게 하였다. 아울러 시뮬레이션된 암의 모형이 실제 임상학적인 모습과 유사하다는 것을 발견하였다.
-
Lee, Wan-Seon;Jeon, Ki-Seon;Um, Chan-Hwi;Hwang, Seung-Young;Jung, Jin-Wook;Kim, Seung-Jun;Kang, Kyung-Sun;Park, Joon-Suk;Hwang, Jae-Woong;Kang, Jong-Soo;Lee, Gyoung-Jae;Chon, Kum-Jin;Kim, Yang-Suk 66
Toxicogenomics is now emerging as one of the most important genomics application because the toxicity test based on gene expression profiles is expected more precise and efficient than current histopathological approach in pre-clinical phase. One of the challenging points in Toxicogenomics is the construction of intelligent database management system which can deal with very heterogeneous and complex data from many different experimental and information sources. Here we present a new Toxicogenomics database developed as a part of 'Toxicogenomics for Efficient Safety Test (TEST) project'. The TEST database is especially focused on the connectivity of heterogeneous data and intelligent query system which enables users to get inspiration from the complex data sets. The database deals with four kinds of information; compound information, histopathological information, gene expression information, and annotation information. Currently, TEST database has Toxicogenomics information fer 12 molecules with 4 efficacy classes; anti cancer, antibiotic, hypotension, and gastric ulcer. Users can easily access all kinds of detailed information about there compounds and simultaneously, users can also check the confidence of retrieved information by browsing the quality of experimental data and toxicity grade of gene generated from our toxicology annotation system. Intelligent query system is designed for multiple comparisons of experimental data because the comparison of experimental data according to histopathological toxicity, compounds, efficacy, and individual variation is crucial to find common genetic characteristics .Our presented system can be a good information source for the study of toxicology mechanism in the genome-wide level and also can be utilized fur the design of toxicity test chip. -
유전 연구를 통해 밝혀지고 있는 단백질은 각각의 기능적 특성을 가지고 서로 영향을 주고받으며 상호 작용한다. 단백질의 기능적 특성은 생물체에서는 단백질이 나타내는 기능으로 단백질 이름은 이들 단백질의 기능을 정확히 나타낼 수 있도록 붙여진다. 기능적 특성에 의해 명명된 단백질은 단백질을 구성하는 단어도 단백질과 유사한 기능 특성을 가질 가능성이 높다. 이는 텍스트 기반의 연구에서 단어가 가지는 중요성에서 비롯된다. 본 논문에서는 단백질을 구성하는 단어들을 단백질의 기능적 특성으로 분류하고, 이 기능분포에 의해서 단백질의 기능을 역으로 예측하고 판단하고자 하였다.
-
본 논문은 대량의 생물의료분야 문서에서 단백질 이름을 자동으로 인식하고 각 단백질의 특성을 문서에서 자동으로 파악하여 기존의 온톨로지와 연계시키는 방법을 제안한다. 온톨로지 용어가 문서에서 다양한 형태로 발견되기 때문에, 이들을 논리적 표현으로 자동 변환하고, 문서에서 단백질의 특성을 설명하는 문장들을 추출 및 분석하여 온톨로지 용어의 논리적 표현과 비교하였다. 문서에서 단백질 특성을 인식할 때, 약어 처리 및 조응 현상 해결 등의 자연언어처리 기법을 이용하는 방법을 제안하였다.
-
Electronically available biological literature has been accumulated exponentially in the course of time. So, researches on automatically acquiring knowledge from these tremendous data by text mining technology become more and more prosperous. However, most of the previous researches are technology oriented and are not well focused in practical extraction target, hence result in low performance and inconvenience for the bio-researchers to actually use. In this paper, we propose a more biology oriented target domain specific text mining system, that is, POSTECH bio-text mining system (POSBIOTM), for signal transduction pathway extraction, especially for G protein-coupled receptor (GPCR) pathway. To reflect more domain knowledge, we specify the concrete target for pathway extraction and define the minimal pathway domain ontology. Under this conceptual model, POSBIOTM extracts interactions and entities of pathways from the full biological articles using a machine learning oriented extraction method and visualizes the pathways using JDesigner module provided in the system biology workbench (SBW) [14]
-
In this paper, we propose solutions to resolve the problem of many spelling variants and the problem of lack of annotated corpus for training, which are two among the main difficulties in named entity recognition in biomedical domain. To resolve the problem of spotting valiants, we propose a use of edit-distance as a feature for SVM. And we propose a use of virtual examples to automatically expand the annotated corpus to resolve the lack-of-corpus problem. Using virtual examples, the annotated corpus can be extended in a fast, efficient and easy way. The experimental results show that the introduction of edit-distance produces some improvements in protein name recognition performance. And the model, which is trained with the corpus expanded by virtual examples, outperforms the model trained with the original corpus. According to the proposed methods, we finally achieve the performance 75.80 in F-measure(71.89% in precision,80.15% in recall) in the experiment of protein name recognition on GENIA corpus (ver.3.0).
-
In this paper, we propose using Hotelling's T2 statistic for the detection of a set of a set of differentially expressed (DE) genes in colorectal cancer based on its gene expression level in tumor tissues compared with those in normal tissues and to evaluate its predictivity which let us rank genes for the development of biomarkers for population screening of colorectal cancer. We compared the prediction rate based on the DE genes selected by Hotelling's T2 statistic and univariate t statistic using various prediction methods, a regulized discrimination analysis and a support vector machine. The result shows that the prediction rate based on T2 is better than that of univatiate t. This implies that it may not be sufficient to look at each gene in a separate universe and that evaluating combinations of genes reveals interesting information that will not be discovered otherwise.
-
이 논문에서는 ER 시그날 시퀀스 서열의 존재 여부와 단백질에의 알파헬릭스 형태의 막횡단 부위를 예측하는 통합시스템을 개발하였다. 기존의 시스템과 달리 이 두 가지 예측을 하나의 통합된 시스템에서 수행하여 예측의 정확성을 높였다. 또한 인터넷에서 이용이 가능하도록 웹 서버(http://dblab.sejong.ac.kr/pass/index.html)를 구현하였다.
-
단백질 구조를 비교하는 방법은 단백질 구조를 표현하는 기술에 따라 다양하게 존재한다. 일반적인 단백질 구조 정렬방법은 단백질 구조를 원자 또는 Residue를 기준으로 표현하고, 표현된 두 구조사이의 일치된 부분을 찾는 방법과 단백질 구조를 단백질 이차구조요소로 표현하고 표현된 두 단백질 구조를 정렬하는 방법으로 크게 구분된다. 이러한 단백질 구조 비교 방법은 단백질 구조의 유사성을 측정하는 과정에서 많은 시간을 요구할 뿐만 아니라 PDB에 저장된 데이터가 증가함에 따라 보다 많은 단백질과 비교가 요구된다. 따라서 대용량의 단백질 구조 데이터베이스를 대상으로 효율적으로 단백질의 유사 부분구조를 찾을 수 있는 방법이 필요하다. 본 논문에서는 단백질 구조 비교를 보다 빠르고 효과적으로 수행하기 위하여, 기존의 단백질 이차구조 기반의 구조 표현 방법인 PSAML을 확장하여 단백질 이차구조가 가지는 공간상의 정보를 내포한 Topology String을 생성하고 이를 이용하여 대용량의 단백질구조 데이터베이스에서 유사성이 높은 단백질 구조를 필터링하는 방법에 대하여 기술한다. Topology String은 단백질 이차구조를 하나의 문자로 기술하여 아미노산 순서와 위상학적인(공간적인) 정보를 바탕으로 단백질 구조를 표현하여, 단백질 이차구조를 이용하여 구조 비교를 수행하기 이전에 유사성이 높은 단백질 구조를 신속하게 찾아내는데 효과적으로 적용될 수 있다.
-
Heo, Mu-Young;Kim, Suhk-Mann;Cheon, Moo-Kyung;Chung, Kwang-Hoon;Moon, Eun-Joung;Chang, Ik-Soo 120
-
This paper describes a genetic algorithm for predicting RNA structures that contain various types of pseudoknots. Pseudolulotted RNA structures are much more difficult to predict by computational methods than RNA secondary structures, as they are more complex and the analysis is time-consuming. We developed an efficient genetic algorithm to predict RNA folding structures containing any type of pseudoknot, as well as a novel initial population method to decrease computational complexity and increase the accuracy of the results. We also used an interaction filter to decrease the size of the possible stem lists for long RNA sequences. We predicted RNA structures using a number of different termination conditions and compared the validity of the results and the times required for the analyses. The algorithm proved able to predict efficiently RNA structures containing various types of pseudoknots in long nucleotide sequences.
-
The prediction of protein secondary structure has been an important bioinformatics tool that is an essential component of the template-based protein tertiary structure prediction process. It has been known that the predicted secondary structure information improves both the fold recognition performance and the alignment accuracy. In this paper, we describe several novel ideas that may improve the prediction accuracy. The main idea is motivated by an observation that the protein's structural information, especially when it is combined with the evolutionary information, significantly improves the accuracy of the predicted tertiary structure. From the non-redundant set of protein structures, we derive the 'potential' parameters for the protein secondary structure prediction that contains the structural information of proteins, by following the procedure similar to the way to derive the directional information table of GOR method. Those potential parameters are combined with the frequency matrices obtained by running PSI-BLAST to construct the feature vectors that are used to train the support vector machines (SVM) to build the secondary structure classifiers. Moreover, the problem of huge model file size, which is one of the known shortcomings of SVM, is partially overcome by reducing the size of training data by filtering out the redundancy not only at the protein level but also at the feature vector level. A preliminary result measured by the average three-state prediction accuracy is encouraging.
-
Park, Chan-Ho;Cho, Sung-Bae;Shin, Ji-Hye;Kim, Sang-Cheol;Seo, Min-Young;Yang, Sang-Hwa;Rha, Sun-Young;Chung, Hyun-Cheol 139
암의 조기 발견 및 예후 예측을 위하여 마이크로어레이 데이터를 이용할 수 있다. 하지만 이를 분석하기 위해서는 40${\mu}g$ 이상의 RNA 샘플이 필요한데, 실제 임상 시료를 사용하는 경우 요구되는 충분한 양을 얻기가 어려운 단점이 있다. 따라서 소량의 RNA 샘플을 채취한 후 PCR 증폭 과정을 통하여 요구되는 양의 샘플을 얻을 수 있는 RNA 증폭 방법이 시도되고 있고, 이를 마이크로어레이 실험에 이용하기 위해서는 증폭 전후의 유사성이 보장되어야 한다. 본 논문에서는 증폭 RNA와 전체 RNA의 유사성을 비교하기 위한 새로운 방법으로 엔트로피 기반의 방법을 제시한다. 아울러 다양한 조건에 따라서 엔트로피값을 측정하여 세포주와 조직에서 엔트로피 값이 어떻게 사용될 수 있는지 체계적인 분석을 하였다. -
Microarray information system is a complex system to manage, analyze and interpretate microarray gene expression data. Establishment of well-defined development process is very essential for understanding the complexity and organization of the system. We performed object-oriented analysis using Unified Modeling Language (UML) in specifying, visualizing and documenting microarray information system. The object-oriented analysis consists of three major steps: (i) use case modeling to describe various functionalities from the user's perspective (ii) dynamic modeling to illustrate behavioral aspects of the system (iii) object modeling to represent structural aspects of the system. As a result of our modeling activities we provide the UML diagrams showing various views of the microarray information system. We believe that the object-oriented analysis ensures effective documentations and communication of information system requirements. Another useful feature of object-oriented technique is structural continuity to standard microarray data model MAGE-OM (Microarray Gene Expression Object Model). The proposed modeling e(forts can be applicable for integration of biomedical information system.
-
마이크로어레이를 이용한 발현실험이 기하급수적으로 늘어남에 따라 마이크로어레이이미지를 자동으로 처리하는 기술에 대한 요구가 커지고 있고, 이에 대한 연구도 많아지고 있다. 마이크로어레이 이미지를 자동으로 처리하기 위해서는 각 이미지가 가지고 있는 스팟 패턴를 알아보고, 자동화 정도를 측정할 수 있는 품질평가함수가 필요하다. 우리는 본 논문에서 마이크로어레이 이미지 분석의 자동화에 대한 평가를 도와주는 스팟 패턴의 품질평가함수(quality measure)를 정의하고, 각 실험이 얼마나 잘 이루어졌는지를 예측할 수 있는 품질제어평가함수(quality control measure)를 정의한다. 또한 마이크로어레이 실험과 이미지에 대한 품질을 평가하기 위해서 이미지내의 블럭들과 스팟들에 대한 통계량을 이용하고, 스팟들의 발현값에 대한 정확도를 측정하기 위한 품질평가함수들을 정의한다. 이러한 품질평가함수의 측정을 위해서 최대정규정점의 집합(maximal regular point set)과 메타그리드를 이용한다.
-
In this paper, we propose the integrated Bayesian network framework to reconstruct genetic regulatory networks from genome expression data. The proposed model overcomes the dimensionality problem of multivariate analysis by building coherent sub-networks from confined gene clusters and combining these networks via intermediary points. Gene Shaving algorithm is used to cluster genes that share a common function or co-regulation. Retrieved clusters incorporate prior biological knowledge such as Gene Ontology, pathway, and protein protein interaction information for extracting other related genes. With these extended gene list, system builds genetic sub-networks using Bayesian network with MDL score and Sparse Candidate algorithm. Identifying functional modules of genes is done by not only microarray data itself but also well-proved biological knowledge. This integrated approach can improve there liability of a network in that false relations due to the lack of data can be reduced. Another advantage is the decreased computational complexity by constrained gene sets. To evaluate the proposed system, S. Cerevisiae cell cycle data [1] is applied. The result analysis presents new hypotheses about novel genetic interactions as well as typical relationships known by previous researches [2].
-
Gene expression data are the quantitative measurements of expression levels and ratios of numberous genes in different situations based on microarray image analysis results. The process to draw meaningful information related to genomic diseases and various biological activities from gene expression data is known as gene expression data analysis. In this paper, we present a hierarchical clustering method of gene expression data based on self organizing map which can analyze the clustering result of gene expression data more efficiently. Using our proposed method, we could eliminate the uncertainty of cluster boundary which is the inherited disadvantage of self organizing map and use the visualization function of hierarchical clustering. And, we could process massive data using fast processing speed of self organizing map and interpret the clustering result of self organizing map more efficiently and user-friendly. To verify the efficiency of our proposed algorithm, we performed tests with following 3 data sets, animal feature data set, yeast gene expression data and leukemia gene expression data set. The result demonstrated the feasibility and utility of the proposed clustering algorithm.
-
The Development of promoter recognition systems is a interesting problem in computational biology. In this paper, we introduce a intelligent system fur promoter recognition with multiple decision models using artificial neural networks. We have trained this models with 1871 human promoter sequences and 5230exon and intron sequences. Our system is found to perform better than other promoter finding systems insensitivity and specificity measures. We have tested our system with Chromosome 22 dataset.
-
With the availability of complete whole-genomes such as the human, mouse, fugu and chimpanzee chromosome 22, comparative analysis of large genomes from cross-species at varying evolutionary distances is considered one of a powerful approach for identifying coding and functional non-coding sequences. Here we describe a fast and efficient global alignment method especially for large genomic regions over mega bases pair. We used an approach for identifying all similarity regions by HSP (Highest Segment Pair) regions using local alignments and then large syntenic genome based on the both extension of anchors at HSP regions in two species and global conservation map. Using this alignment approach, we examined rearrangement loci in human chromosome 21 and chimpanzee chromosome 22. Finally, we extracted syntenic genome 30 Mb of human chromosome 21 with chimpanzee chromosome 22, and then identified genomic rearrangements (deletions and insertions ranging h size from 0.3 to 200 kb). Our experiment shows that all jnsertion/deletion (indel) events in excess of 300 bp within chimpanzee chromosome 22 and human chromosome 21 alignments in order to identify new insertions that had occurred over the last 7 million years of evolution. Finally we also discussed evolutionary features throughout comparative analyses of Ka/ks (non-synonymous / synonymous substitutions) rate in orthologous 119 genes of chromosome 21 and 53 genes of MHC-I class in human and chimpanzee genome.
-
The application of finding occurrences of a pattern that contains gaps includes information retrieval, data mining, and computational biology. As the biological sequences may contain errors, it is important to find not only the exact occurrences of a pattern but also approximate ones. In this paper we present an O(mnk
$_{max}$ /w) time algorithm for the approximate gapped pattern matching problem, where m is the length of the text, H is the length of the pattern, w is the word size of the target machine, and k$_{max}$ is the greatest error bound for subpatterns. -
주어진 염기서열에서 유전자 영역을 예측하는 유전자 구조 예측은 유전체 프로젝트의 중요한 과정 중 하나이며 유전체 프로젝트 전체에 큰 영향을 준다. 진핵생물의 유전체가 원핵생물의 유전체에 비해 더 복잡한 구조를 가지기 때문에 진핵생물의 유전자 구조 예측 모델 역시원핵생물에 비해 다양한 모델이 제안되었다. 본 연구팀은 duration hidden markov model을 기본형태로 하여 EGSP(Eukaryotic Gene Structure Prediction)프로그램을 개발하였다. 현재 개발된 진핵생물의 유전자 구조 예측 알고리즘 중에서 GenScan이 가장 정교한 젓으로 보고 되고 있는데, EGSP의 결과분석을 위해 Genscan과 함께 GeneID, Morgan의 예측결과를 여러 가지 기준에서 비교하였다. EGSP는 정교한 예측모델을 가지고 있음에도 각 구성모듈에 대한 파라메터의 정교함에서 부족한 면이 나타나므로, 모델의 개선과 각 모듈의 조율을 통해 더욱 개선된 결과를 가지게 될 것이다.
-
오페론(operon)은 보통 미생물에서 다수의 인접한 유전자들로 구성된 그룹으로 하나의 유전자처럼 공통된 프로모터에 의해 전사되는 단위이다. 오페론을 구성하는 유전자들은 기능적으로 서로 유사하거나 같은 물질대사경로(metabolic pathway) 상에 존재하는 특징을 지니기 때문에 이들은 중요한 의미를 가지며, 미생물 유전체 분석에서 오페론을 구성하는 유전자들을 예측하는 것은 상당히 중요하다. 오페론을 예측하는 이전 연구들로는 이미 알려진 오페론의 특징인 유전자간 거리나 오페론을 구성하는 평균 유전자 개수 등을 이용하는 방법, 마이크로어레이 발현 실험을 이용한 방법, 전유전체(whole genome)들 간의 보존된 유전자 집합(conserved gene cluster)을 이용한 방법 그리고 물질대사경로를 이용한 방법 등이 있다. 본 논문에서는 COG 기능(function) 거리, 유전자 간의 거리, 코돈 사용빈도(codon usage) 그리고COG 기능 거리와 유전자간 거리를 같이 적용한 방법을 이용하여 오페론 예측을 위한 전처리 모델을 생성하였다 전처리 모델을 E. coli 전유전체에 적용해본 결과, 알려진 오페론들의 약 90%가 이를 포함하였다. 따라서 본 논문에서 제시한 전처리 모델은, 추후 오페론 예측을 위한 좋은 도구로 활용할 수 있을 것이다.
-
Jung, Ho-Youl;Heo, Jee-Yeon;Cho, Hye-Yeung;Ryu, Gil-Mi;Lee, Ju-Young;Koh, In-Song;Kimm, Ku-Chan;Oh, Berm-Seok 221
This paper presents a novel method that can identify the individual's haplotype from the given genotypes. Because of the limitation of the conventional single-locus analysis, haplotypes have gained increasing attention in the mapping of complex-disease genes. Conventionally there are two approaches which resolve the individual's haplotypes. One is the molecular haplotypings which have many potential limitations in cost and convenience. The other is the in-silico haplotypings which phase the haplotypes from the diploid genotyped populations, and are cost effective and high-throughput method. In-silico haplotyping is divided into two sub-categories - statistical and computational method. The former computes the frequencies of the common haplotypes, and then resolves the individual's haplotypes. The latter directly resolves the individual's haplotypes using the perfect phylogeny model first proposed by Dan Gusfield [7]. Our method combines two approaches in order to increase the accuracy and the running time. The individuals' haplotypes are resolved by considering the MLE (Maximum Likelihood Estimation) in the process of computing the frequencies of the common haplotypes. -
Research of basis technology to construct the human haplotype map is one of active areas in SNP post-genomics research. Identification of haplotype block structure from haplotype data is key step in the haplotype map project. Several algorithms have been proposed for the block identification, including the greedy algorithm, and the dynamic programming based algorithm. This paper analyzed block partitioning method of several algorithm which has been proposed in recent years. HapBlock and HaploBlockFinder are programs used in our experiment.
-
Summary: The analysis of human genetic variation is one of the key issues far the understanding of the different drug response among individuals and many programs are developed for this purpose. However, current publicly available programs have so many limitations such as time complexity problem for the analysis of large amount of alleles or SNPs, difficult manipulation for installation, data import, and usage, and low-quality visual output. Here we present workbench for SNP anlaysis, SNPAnalyzer. SNPAnalyzer consists of 3 main modules: 1)Hardy-Weinberg Equilibrium ,2) Haplotype Estimation, and 3) Linkage Disequilibrium. Each module has several different widely-used algorithms for the extensive analysis and can handle large amount of alleles and SNPs with simple format. Analysis results are displayed in user-friendly formats such as table, graph and map. SNPAnalyzer is developed using C and C
$^{++}$ and users can easily access through web-interftce. Availability: SNPAnalyzer can be freely implemented at http://www.istech.info/istech/board/login_form.jsp -
There are currently about 6000 bacterial species with validly published names, but scientists assume that these may be less than 1% of bacterial species present on the earth. Microbial resource is one of the most important bioresources in bioinderstry and provides us with high economic values. To find missing ones, the studies of metagenome, metabolome, and proteome about microbes have started recently in developed countries. We construct the information system that integrates information on microbial genome resources and manages the information to support efficient research of microbial genome application, and name this system 'Bio-Meta Information System (Bio-MIS)'. Bio-MIS consists of integrated microbial genome resources database, microbial genome resources input system, integrated microbial genome resources search engine, microbial resources on-line distribution system, portal service and management via internet. In the future, we will include public database connection and implement useful bioinformatics software for analyzing microbial genome resources. The web-site is accessible at http://biomis.probionic.com
-
Metabolic engineering has become a new paradigm for the more efficient production of desired bioproducts. Metabolic engineering can be defined as directed modification of cellular metabolism and properties through the introduction, deletion, and modification of metabolic pathways by using recombinant DNA and other molecular biological tools. During the last decade, metabolic flux analysis(MFA) has become an essential tool fur metabolic engineering. By MFA, the intracellular metabolic fluxes can be quantified by the measurement of extracellular metabolite concentrations in combination with the stoichiometry of intracellular reactions and mass balances. The usefulness and functionality of MFA are demonstrated by applying to metabolic pathways in E. coli. First, a large-scale in silico E. coli model is constructed, and then the effects of carbon sources on intracellular flux distributions and succinic acid production were investigated on the basis of the uptake and secretion rates of the relevant metabolites. The results indicated that succinic acid yields increased in order of gluconate, glucose and sorbitol. Acetic acid and lactic acid were produced as major products rather than when gluconate and glucose were used carbon sources. The results indicated that among three carbon sources available, the most reduced substrate is sorbitol which yields efficient succinic acid production.
-
Lee, Hak-Joo;Song, Ji-Young;Lee, Keun-Jun;Park, Sung-Yong;Jung, Sung-Won;Yang, Ji-Hoon;Nang, Jong-Ho 258
네트워크의 발전으로 인해 현재 존재하는 소프트웨어 구조에 몇 가지 문제점을 내포하게 되었다. 이러한 환경에 대응하기 위해 생태계 구조를 모방한(Bio-inspired) 네트워크 기반적응 생존형 시스템을 제안한다. 여기서는 생태계의 여러 특성 중 적응성(adaptability) 확장성(scalability), 생존성(survivability)을 모델링 하고자한다. 이 시스템은 상기의 특성을 포함하는 몇 개의 계층으로 구성되어 있다. 독립된 개체의 역할을 하는 에코전트 레이어와 에코전트의 활동을 지원하는 에코전트 플랫폼 레이어, 효율적인 네트워크 활용을 위한 플랫폼 콜레버레이션 레이어로 이루어져 있다. 본 논문에서는 이러한 시스템의 구체적인 기능과 구성 그리고 이 시스템의 활용 분야에 대해 살펴본다. -
Most currently known molecular structures were determined by X-ray crystallography or Nuclear Magnetic Resonance (NMR). These methods generate a large amount of structure data, even far small molecules, and consist mainly of three-dimensional atomic coordinates. These are useful for analyzing molecular structure, but structure elements at higher level are also needed for a complete understanding of structure, and especially for structure prediction. Computational approaches exist for identifying secondary structural elements in proteins from atomic coordinates. However, similar methods have not been developed for RNA due in part to the very small amount of structure data so far available, and extracting the structural elements of RNA requires substantial manual work. Since the number of three-dimensional RNA structures is increasing, a more systematic and automated method is needed. We have developed a set of algorithms for recognizing secondary and tertiary structural elements in RNA molecules and in the protein-RNA structures in protein data banks (PDB). The present work represents the first attempt at extracting RNA structure elements from atomic coordinates in structure dat