• Title/Summary/Keyword: Biomedical Text Mining

Search Result 40, Processing Time 0.018 seconds

OryzaGP: rice gene and protein dataset for named-entity recognition

  • Larmande, Pierre;Do, Huy;Wang, Yue
    • Genomics & Informatics
    • /
    • v.17 no.2
    • /
    • pp.17.1-17.3
    • /
    • 2019
  • Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.

COVID-19 recommender system based on an annotated multilingual corpus

  • Barros, Marcia;Ruas, Pedro;Sousa, Diana;Bangash, Ali Haider;Couto, Francisco M.
    • Genomics & Informatics
    • /
    • v.19 no.3
    • /
    • pp.24.1-24.7
    • /
    • 2021
  • Tracking the most recent advances in Coronavirus disease 2019 (COVID-19)-related research is essential, given the disease's novelty and its impact on society. However, with the publication pace speeding up, researchers and clinicians require automatic approaches to keep up with the incoming information regarding this disease. A solution to this problem requires the development of text mining pipelines; the efficiency of which strongly depends on the availability of curated corpora. However, there is a lack of COVID-19-related corpora, even more, if considering other languages besides English. This project's main contribution was the annotation of a multilingual parallel corpus and the generation of a recommendation dataset (EN-PT and EN-ES) regarding relevant entities, their relations, and recommendation, providing this resource to the community to improve the text mining research on COVID-19-related literature. This work was developed during the 7th Biomedical Linked Annotation Hackathon (BLAH7).

Standard-based Integration of Heterogeneous Large-scale DNA Microarray Data for Improving Reusability

  • Jung, Yong;Seo, Hwa-Jeong;Park, Yu-Rang;Kim, Ji-Hun;Bien, Sang Jay;Kim, Ju-Han
    • Genomics & Informatics
    • /
    • v.9 no.1
    • /
    • pp.19-27
    • /
    • 2011
  • Gene Expression Omnibus (GEO) has kept the largest amount of gene-expression microarray data that have grown exponentially. Microarray data in GEO have been generated in many different formats and often lack standardized annotation and documentation. It is hard to know if preprocessing has been applied to a dataset or not and in what way. Standard-based integration of heterogeneous data formats and metadata is necessary for comprehensive data query, analysis and mining. We attempted to integrate the heterogeneous microarray data in GEO based on Minimum Information About a Microarray Experiment (MIAME) standard. We unified the data fields of GEO Data table and mapped the attributes of GEO metadata into MIAME elements. We also discriminated non-preprocessed raw datasets from others and processed ones by using a two-step classification method. Most of the procedures were developed as semi-automated algorithms with some degree of text mining techniques. We localized 2,967 Platforms, 4,867 Series and 103,590 Samples with covering 279 organisms, integrated them into a standard-based relational schema and developed a comprehensive query interface to extract. Our tool, GEOQuest is available at http://www.snubi.org/software/GEOQuest/.

Association Analysis of Reactive Oxygen Species-Hypertension Genes Discovered by Literature Mining

  • Lim, Ji Eun;Hong, Kyung-Won;Jin, Hyun-Seok;Oh, Bermseok
    • Genomics & Informatics
    • /
    • v.10 no.4
    • /
    • pp.244-248
    • /
    • 2012
  • Oxidative stress, which results in an excessive product of reactive oxygen species (ROS), is one of the fundamental mechanisms of the development of hypertension. In the vascular system, ROS have physical and pathophysiological roles in vascular remodeling and endothelial dysfunction. In this study, ROS-hypertension-related genes were collected by the biological literature-mining tools, such as SciMiner and gene2pubmed, in order to identify the genes that would cause hypertension through ROS. Further, single nucleotide polymorphisms (SNPs) located within these gene regions were examined statistically for their association with hypertension in 6,419 Korean individuals, and pathway enrichment analysis using the associated genes was performed. The 2,945 SNPs of 237 ROS-hypertension genes were analyzed, and 68 genes were significantly associated with hypertension (p < 0.05). The most significant SNP was rs2889611 within MAPK8 (p = $2.70{\times}10^{-5}$; odds ratio, 0.82; confidence interval, 0.75 to 0.90). This study demonstrates that a text mining approach combined with association analysis may be useful to identify the candidate genes that cause hypertension through ROS or oxidative stress.

Currents in Integrative Biochip Informatics

  • Kim, Ju-Han
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2001.10a
    • /
    • pp.1-9
    • /
    • 2001
  • scale genomic and postgenomic data means that many of the challenges in biomedical research are now challenges in computational sciences and information technology. The informatics revolutions both in clinical informatics and bioinformatics will change the current paradigm of biomedical sciences and practice of clinical medicine, including diagnostics, therapeutics, and prognostics. Postgenome informatics, powered by high throughput technologies and genomic-scale databases, is likely to transform our biomedical understanding forever much the same way that biochemistry did a generation ago. In this talk, 1 will describe how these technologies will in pact biomedical research and clinical care, emphasizing recent advances in biochip-based functional genomics. Basic data preprocessing with normalization and filtering, primary pattern analysis, and machine teaming algorithms will be presented. Issues of integrated biochip informatics technologies including multivariate data projection, gene-metabolic pathway mapping, automated biomolecular annotation, text mining of factual and literature databases, and integrated management of biomolecular databases will be discussed. Each step will be given with real examples from ongoing research activities in the context of clinical relevance. Issues of linking molecular genotype and clinical phenotype information will be discussed.

  • PDF

Bioinformatics and Genomic Medicine (생명정보학과 유전체의학)

  • Kim, Ju-Han
    • Journal of Preventive Medicine and Public Health
    • /
    • v.35 no.2
    • /
    • pp.83-91
    • /
    • 2002
  • Bioinformatics is a rapidly emerging field of biomedical research. A flood of large-scale genomic and postgenomic data means that many of the challenges in biomedical research are now challenges in computational sciences. Clinical informatics has long developed methodologies to improve biomedical research and clinical care by integrating experimental and clinical information systems. The informatics revolutions both in bioinformatics and clinical informatics will eventually change the current practice of medicine, including diagnostics, therapeutics, and prognostics. Postgenome informatics, powered by high throughput technologies and genomic-scale databases, is likely to transform our biomedical understanding forever much the same way that biochemistry did a generation ago. The paper describes how these technologies will impact biomedical research and clinical care, emphasizing recent advances in biochip-based functional genomics and proteomics. Basic data preprocessing with normalization, primary pattern analysis, and machine learning algorithms will be presented. Use of integrated biochip informatics technologies, text mining of factual and literature databases, and integrated management of biomolecular databases will be discussed. Each step will be given with real examples in the context of clinical relevance. Issues of linking molecular genotype and clinical phenotype information will be discussed.

Effective text visualization for biomedical information (생물 의료 정보의 효과적인 텍스트 시각화)

  • Kim, Tak-Eun;Park, Jong-C.
    • 한국HCI학회:학술대회논문집
    • /
    • 2007.02a
    • /
    • pp.399-405
    • /
    • 2007
  • 생물 의료 분야에서 정보의 양이 아주 빠르게 증가하고 있다. 이러한 방대한 양의 정보에서 유용한 정보를 추출하기 위해 텍스트 마이닝 기법을 이용한 연구들이 많이 진행되어 왔다. 그렇지만 이렇게 뽑아진 정보조차 그 양이 방대하고, 또한 텍스트로 되어 있기 때문에 직관적으로 이해하기가 어렵다. 따라서 이러한 정보들을 좀 더 직관적으로 이해하기 위해서는 정보 시각화 시스템이 필수적이다. 최근 들어 이러한 정보 시각화에 대한 연구가 많이 진행되었으나 이러한 시각화 정보조차 너무나 방대하기 때문에 사용자가 필요로 하는 정보를 여과해 주는 방법이 필요하다. 그리고 시각화 시스템에서의 지식 발견을 위한 방법을 제공하여야 한다. 본 논문에서는 생물 의료 정보의 텍스트 시각화에 초점을 맞추어 생물 의료 정보의 효과적인 표현 방법과 지식 발견을 위한 직관적인 인터페이스를 제안하고자 한다.

  • PDF

Improving methods for normalizing biomedical text entities with concepts from an ontology with (almost) no training data at BLAH5 the CONTES

  • Ferre, Arnaud;Ba, Mouhamadou;Bossy, Robert
    • Genomics & Informatics
    • /
    • v.17 no.2
    • /
    • pp.20.1-20.5
    • /
    • 2019
  • Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.

Towards cross-platform interoperability for machine-assisted text annotation

  • de Castilho, Richard Eckart;Ide, Nancy;Kim, Jin-Dong;Klie, Jan-Christoph;Suderman, Keith
    • Genomics & Informatics
    • /
    • v.17 no.2
    • /
    • pp.19.1-19.10
    • /
    • 2019
  • In this paper, we investigate cross-platform interoperability for natural language processing (NLP) and, in particular, annotation of textual resources, with an eye toward identifying the design elements of annotation models and processes that are particularly problematic for, or amenable to, enabling seamless communication across different platforms. The study is conducted in the context of a specific annotation methodology, namely machine-assisted interactive annotation (also known as human-in-the-loop annotation). This methodology requires the ability to freely combine resources from different document repositories, access a wide array of NLP tools that automatically annotate corpora for various linguistic phenomena, and use a sophisticated annotation editor that enables interactive manual annotation coupled with on-the-fly machine learning. We consider three independently developed platforms, each of which utilizes a different model for representing annotations over text, and each of which performs a different role in the process.