• 제목/요약/키워드: biological dataset

검색결과 126건 처리시간 0.02초

OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition

  • Larmande, Pierre;Liu, Yusha;Yao, Xinzhi;Xia, Jingbo
    • Genomics & Informatics
    • /
    • 제19권3호
    • /
    • pp.27.1-27.4
    • /
    • 2021
  • Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pretrained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP.

한의임상정보은행 활용도 제고를 위한 교육용 데이터 개발 (Development of Korean Medicine Data Center(KDC) Teaching Dataset to Enhance Utilization of KDC)

  • 백영화;이시우
    • 사상체질의학회지
    • /
    • 제29권3호
    • /
    • pp.242-247
    • /
    • 2017
  • Objective Korean medicine Data Center (KDC) has established large-scale biological and clinical data based on Korean medicine to demonstrate and validate its theory. The aim of this study was to develop KDC teaching dataset and user guideline to improve utilization of the KDC. Method KDC teaching dataset were selected using stratified random sampling according to the Sasang constitution (SC). This dataset included 72 variables of 500 sample subjects. The user guideline described how to conducted eight statistical analysis methods using the teaching dataset. Results The KDC teaching dataset was sampled from 200(40%) Taeeumin, 125(25%) Soeumin, and 175(35%) Soyanain. It was consisted of questionnaire (basic, habit, disease, symptom), physical exam (body measurement, blood pressure), blood exam, and expert' SC diagnosis. The usage guidelines provided instruction for users to perform several statistical analysis step by step with KDC teaching dataset. Conclusion We hope that our results will contribute to enhancing KDC utilization and understanding.

짝지어진 데이터셋을 이용한 분할-정복 U-net 기반 고화질 초음파 영상 복원 (A Divide-Conquer U-Net Based High-Quality Ultrasound Image Reconstruction Using Paired Dataset)

  • 유민하;안치영
    • 대한의용생체공학회:의공학회지
    • /
    • 제45권3호
    • /
    • pp.118-127
    • /
    • 2024
  • Commonly deep learning methods for enhancing the quality of medical images use unpaired dataset due to the impracticality of acquiring paired dataset through commercial imaging system. In this paper, we propose a supervised learning method to enhance the quality of ultrasound images. The U-net model is designed by incorporating a divide-and-conquer approach that divides and processes an image into four parts to overcome data shortage and shorten the learning time. The proposed model is trained using paired dataset consisting of 828 pairs of low-quality and high-quality images with a resolution of 512x512 pixels obtained by varying the number of channels for the same subject. Out of a total of 828 pairs of images, 684 pairs are used as the training dataset, while the remaining 144 pairs served as the test dataset. In the test results, the average Mean Squared Error (MSE) was reduced from 87.6884 in the low-quality images to 45.5108 in the restored images. Additionally, the average Peak Signal-to-Noise Ratio (PSNR) was improved from 28.7550 to 31.8063, and the average Structural Similarity Index (SSIM) was increased from 0.4755 to 0.8511, demonstrating significant enhancements in image quality.

OryzaGP: rice gene and protein dataset for named-entity recognition

  • Larmande, Pierre;Do, Huy;Wang, Yue
    • Genomics & Informatics
    • /
    • 제17권2호
    • /
    • pp.17.1-17.3
    • /
    • 2019
  • Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.

딥러닝을 위한 마스크 착용 유형별 데이터셋 구축 및 검출 모델에 관한 연구 (The Study for Type of Mask Wearing Dataset for Deep learning and Detection Model)

  • 황호성;김동현;김호철
    • 대한의용생체공학회:의공학회지
    • /
    • 제43권3호
    • /
    • pp.131-135
    • /
    • 2022
  • Due to COVID-19, Correct method of wearing mask is important to prevent COVID-19 and the other respiratory tract infections. And the deep learning technology in the image processing has been developed. The purpose of this study is to create the type of mask wearing dataset for deep learning models and select the deep learning model to detect the wearing mask correctly. The Image dataset is the 2,296 images acquired using a web crawler. Deep learning classification models provided by tensorflow are used to validate the dataset. And Object detection deep learning model YOLOs are used to select the detection deep learning model to detect the wearing mask correctly. In this process, this paper proposes to validate the type of mask wearing datasets and YOLOv5 is the effective model to detect the type of mask wearing. The experimental results show that reliable dataset is acquired and the YOLOv5 model effectively recognize type of mask wearing.

Integration of a Large-Scale Genetic Analysis Workbench Increases the Accessibility of a High-Performance Pathway-Based Analysis Method

  • Lee, Sungyoung;Park, Taesung
    • Genomics & Informatics
    • /
    • 제16권4호
    • /
    • pp.39.1-39.3
    • /
    • 2018
  • The rapid increase in genetic dataset volume has demanded extensive adoption of biological knowledge to reduce the computational complexity, and the biological pathway is one well-known source of such knowledge. In this regard, we have introduced a novel statistical method that enables the pathway-based association study of large-scale genetic dataset-namely, PHARAOH. However, researcher-level application of the PHARAOH method has been limited by a lack of generally used file formats and the absence of various quality control options that are essential to practical analysis. In order to overcome these limitations, we introduce our integration of the PHARAOH method into our recently developed all-in-one workbench. The proposed new PHARAOH program not only supports various de facto standard genetic data formats but also provides many quality control measures and filters based on those measures. We expect that our updated PHARAOH provides advanced accessibility of the pathway-level analysis of large-scale genetic datasets to researchers.

딥러닝을 이용한 창상 분할 알고리즘 (Development of wound segmentation deep learning algorithm)

  • 강현영;허연우;전재준;정승원;김지예;박성빈
    • 대한의용생체공학회:의공학회지
    • /
    • 제45권2호
    • /
    • pp.90-94
    • /
    • 2024
  • Diagnosing wounds presents a significant challenge in clinical settings due to its complexity and the subjective assessments by clinicians. Wound deep learning algorithms quantitatively assess wounds, overcoming these challenges. However, a limitation in existing research is reliance on specific datasets. To address this limitation, we created a comprehensive dataset by combining open dataset with self-produced dataset to enhance clinical applicability. In the annotation process, machine learning based on Gradient Vector Flow (GVF) was utilized to improve objectivity and efficiency over time. Furthermore, the deep learning model was equipped U-net with residual blocks. Significant improvements were observed using the input dataset with images cropped to contain only the wound region of interest (ROI), as opposed to original sized dataset. As a result, the Dice score remarkably increased from 0.80 using the original dataset to 0.89 using the wound ROI crop dataset. This study highlights the need for diverse research using comprehensive datasets. In future study, we aim to further enhance and diversify our dataset to encompass different environments and ethnicities.

Supervised Model for Identifying Differentially Expressed Genes in DNA Microarray Gene Expression Dataset Using Biological Pathway Information

  • Chung, Tae Su;Kim, Keewon;Kim, Ju Han
    • Genomics & Informatics
    • /
    • 제3권1호
    • /
    • pp.30-34
    • /
    • 2005
  • Microarray technology makes it possible to measure the expressions of tens of thousands of genes simultaneously under various experimental conditions. Identifying differentially expressed genes in each single experimental condition is one of the most common first steps in microarray gene expression data analysis. Reasonable choices of thresholds for determining differentially expressed genes are used for the next-stap-analysis with suitable statistical significances. We present a supervised model for identifying DEGs using pathway information based on the global connectivity structure. Pathway information can be regarded as a collection of biological knowledge, thus we are trying to determine the optimal threshold so that the consequential connectivity structure can be the most compatible with the existing pathway information. The significant feature of our model is that it uses established knowledge as a reference to determine the direction of analyzing microarray dataset. In the most of previous work, only intrinsic information in the miroarray is used for the identifying DEGs. We hope that our proposed method could contribute to construct biologically meaningful structure from microarray datasets.

Identification of Combined Biomarker for Predicting Alzheimer's Disease Using Machine Learning

  • Ki-Yeol Kim
    • 생물정신의학
    • /
    • 제30권1호
    • /
    • pp.24-30
    • /
    • 2023
  • Objectives Alzheimer's disease (AD) is the most common form of dementia in older adults, damaging the brain and resulting in impaired memory, thinking, and behavior. The identification of differentially expressed genes and related pathways among affected brain regions can provide more information on the mechanisms of AD. The aim of our study was to identify differentially expressed genes associated with AD and combined biomarkers among them to improve AD risk prediction accuracy. Methods Machine learning methods were used to compare the performance of the identified combined biomarkers. In this study, three publicly available gene expression datasets from the hippocampal brain region were used. Results We detected 31 significant common genes from two different microarray datasets using the limma package. Some of them belonged to 11 biological pathways. Combined biomarkers were identified in two microarray datasets and were evaluated in a different dataset. The performance of the predictive models using the combined biomarkers was superior to those of models using a single gene. When two genes were combined, the most predictive gene set in the evaluation dataset was ATR and PRKCB when linear discriminant analysis was applied. Conclusions Combined biomarkers showed good performance in predicting the risk of AD. The constructed predictive nomogram using combined biomarkers could easily be used by clinicians to identify high-risk individuals so that more efficient trials could be designed to reduce the incidence of AD.

대용량의 Haplotype과 Genotype데이터에 대한 LD기반의 tagSNP 선택 시스템 (LD-based tagSNP Selection System for Large-scale Haplotype and Genotype Datasets)

  • Kim, Sang-Jun;Yeo, Sang-Soo;Kim, Sung-Kwon
    • 한국생물정보학회:학술대회논문집
    • /
    • 한국생물정보시스템생물학회 2004년도 The 3rd Annual Conference for The Korean Society for Bioinformatics Association of Asian Societies for Bioinformatics 2004 Symposium
    • /
    • pp.279-285
    • /
    • 2004
  • In the disease association study, the tagSNP selection problem is important at the view of time and cost. We developed the new tagSNP selection system that has also facilities for the haplotype reconstruction and missing data processing. In our system, we improved biological meanings using LD coefficients as well as dynamic programming method. And our system has capability of processing large -scale dataset, such as the total SNPs on a chromosome. We have tested our system with various dataset from daly et al., patil et al., HapMap Project, artificial dataset, and so on.

  • PDF