• 제목/요약/키워드: Large Dataset

Search Result 561, Processing Time 0.025 seconds

Development of Korean Medicine Data Center(KDC) Teaching Dataset to Enhance Utilization of KDC (한의임상정보은행 활용도 제고를 위한 교육용 데이터 개발)

  • Baek, Younghwa;Lee, Siwoo
    • Journal of Sasang Constitutional Medicine
    • /
    • v.29 no.3
    • /
    • pp.242-247
    • /
    • 2017
  • Objective Korean medicine Data Center (KDC) has established large-scale biological and clinical data based on Korean medicine to demonstrate and validate its theory. The aim of this study was to develop KDC teaching dataset and user guideline to improve utilization of the KDC. Method KDC teaching dataset were selected using stratified random sampling according to the Sasang constitution (SC). This dataset included 72 variables of 500 sample subjects. The user guideline described how to conducted eight statistical analysis methods using the teaching dataset. Results The KDC teaching dataset was sampled from 200(40%) Taeeumin, 125(25%) Soeumin, and 175(35%) Soyanain. It was consisted of questionnaire (basic, habit, disease, symptom), physical exam (body measurement, blood pressure), blood exam, and expert' SC diagnosis. The usage guidelines provided instruction for users to perform several statistical analysis step by step with KDC teaching dataset. Conclusion We hope that our results will contribute to enhancing KDC utilization and understanding.

LD-based tagSNP Selection System for Large-scale Haplotype and Genotype Datasets (대용량의 Haplotype과 Genotype데이터에 대한 LD기반의 tagSNP 선택 시스템)

  • Kim, Sang-Jun;Yeo, Sang-Soo;Kim, Sung-Kwon
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2004.11a
    • /
    • pp.279-285
    • /
    • 2004
  • In the disease association study, the tagSNP selection problem is important at the view of time and cost. We developed the new tagSNP selection system that has also facilities for the haplotype reconstruction and missing data processing. In our system, we improved biological meanings using LD coefficients as well as dynamic programming method. And our system has capability of processing large -scale dataset, such as the total SNPs on a chromosome. We have tested our system with various dataset from daly et al., patil et al., HapMap Project, artificial dataset, and so on.

  • PDF

Multi-faceted Image Dataset Construction Method Based on Rotational Images. (회전 영상 기반 다면 영상 데이터셋 구축 방법)

  • Kim, Ji-Seong;Heo, Gyeongyong;Jang, Si-Woong
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.75-77
    • /
    • 2021
  • In order to find objects in an image through deep learning technology, an image dataset for learning is required. In order to increase the recognition rate of objects, a large amount of image learning data is required. It is difficult for individuals to build large amounts of datasets because it is expensive. This paper introduces a method for more easily constructing an image dataset including several sides of an object by photographing a rotating image. A method of constructing a dataset by placing an object on a rotating plate, photographing it, and dividing and synthesizing the captured images according to the needs is proposed.

  • PDF

Korean Lip-Reading: Data Construction and Sentence-Level Lip-Reading (한국어 립리딩: 데이터 구축 및 문장수준 립리딩)

  • Sunyoung Cho;Soosung Yoon
    • Journal of the Korea Institute of Military Science and Technology
    • /
    • v.27 no.2
    • /
    • pp.167-176
    • /
    • 2024
  • Lip-reading is the task of inferring the speaker's utterance from silent video based on learning of lip movements. It is very challenging due to the inherent ambiguities present in the lip movement such as different characters that produce the same lip appearances. Recent advances in deep learning models such as Transformer and Temporal Convolutional Network have led to improve the performance of lip-reading. However, most previous works deal with English lip-reading which has limitations in directly applying to Korean lip-reading, and moreover, there is no a large scale Korean lip-reading dataset. In this paper, we introduce the first large-scale Korean lip-reading dataset with more than 120 k utterances collected from TV broadcasts containing news, documentary and drama. We also present a preprocessing method which uniformly extracts a facial region of interest and propose a transformer-based model based on grapheme unit for sentence-level Korean lip-reading. We demonstrate that our dataset and model are appropriate for Korean lip-reading through statistics of the dataset and experimental results.

Mid-level Feature Extraction Method Based Transfer Learning to Small-Scale Dataset of Medical Images with Visualizing Analysis

  • Lee, Dong-Ho;Li, Yan;Shin, Byeong-Seok
    • Journal of Information Processing Systems
    • /
    • v.16 no.6
    • /
    • pp.1293-1308
    • /
    • 2020
  • In fine-tuning-based transfer learning, the size of the dataset may affect learning accuracy. When a dataset scale is small, fine-tuning-based transfer-learning methods use high computing costs, similar to a large-scale dataset. We propose a mid-level feature extractor that retrains only the mid-level convolutional layers, resulting in increased efficiency and reduced computing costs. This mid-level feature extractor is likely to provide an effective alternative in training a small-scale medical image dataset. The performance of the mid-level feature extractor is compared with the performance of low- and high-level feature extractors, as well as the fine-tuning method. First, the mid-level feature extractor takes a shorter time to converge than other methods do. Second, it shows good accuracy in validation loss evaluation. Third, it obtains an area under the ROC curve (AUC) of 0.87 in an untrained test dataset that is very different from the training dataset. Fourth, it extracts more clear feature maps about shape and part of the chest in the X-ray than fine-tuning method.

Towards Texture-Based Visualization of Multivariate Dataset

  • Mehmood, Raja Majid;Lee, Hyo Jong
    • Annual Conference of KIPS
    • /
    • 2014.04a
    • /
    • pp.582-585
    • /
    • 2014
  • Visualization is a science which makes the invisible to visible through the techniques of experimental visualization and computer-aided visualization. This paper presents the practical aspects of visualization of multivariate dataset. In this paper, we will briefly discuss a previous research work and introduce a new visualization technique which will help us to design and develop a visualization tool for experimental visualization of multivariate dataset. Our newly developed visualization tool can be used in various domains. In this paper, we have chosen a software industry as an application domain and we used the multivariate dataset of software components computed by VizzMaintenance. VizzMaintenance is software analysis tool which give us multiple software metrics of open source Java based programs. Main objective of this research is to develop a new visualization tool for large multivariate dataset which will be more efficient and easy to perceive by viewer. Perception is very important for our research work and we have decided to test the perception level of our proposed visualization approach by researchers of our research lab.

Generation of wind turbine blade surface defect dataset based on StyleGAN3 and PBGMs

  • W.R. Li;W.H. Zhao;T.T. Wang;Y.F. Du
    • Smart Structures and Systems
    • /
    • v.34 no.2
    • /
    • pp.129-143
    • /
    • 2024
  • In recent years, with the vigorous development of visual algorithms, a large amount of research has been conducted on blade surface defect detection methods represented by deep learning. Detection methods based on deep learning models must rely on a large and rich dataset. However, the geographical location and working environment of wind turbines makes it difficult to effectively capture images of blade surface defects, which inevitably hinders visual detection. In response to the challenge of collecting a dataset for surface defects that are difficult to obtain, a multi-class blade surface defect generation method based on the StyleGAN3 (Style Generative Adversarial Networks 3) deep learning model and PBGMs (Physics-Based Graphics Models) method has been proposed. Firstly, a small number of real blade surface defect datasets are trained using the adversarial neural network of the StyleGAN3 deep learning model to generate a large number of high-resolution blade surface defect images. Secondly, the generated images are processed through Matting and Resize operations to create defect foreground images. The blade background images produced using PBGM technology are randomly fused, resulting in a diverse and high-resolution blade surface defect dataset with multiple types of backgrounds. Finally, experimental validation has proven that the adoption of this method can generate images with defect characteristics and high resolution, achieving a proportion of over 98.5%. Additionally, utilizing the EISeg annotation method significantly reduces the annotation time to just 1/7 of the time required for traditional methods. These generated images and annotated data of blade surface defects provide robust support for the detection of blade surface defects.

Integration of a Large-Scale Genetic Analysis Workbench Increases the Accessibility of a High-Performance Pathway-Based Analysis Method

  • Lee, Sungyoung;Park, Taesung
    • Genomics & Informatics
    • /
    • v.16 no.4
    • /
    • pp.39.1-39.3
    • /
    • 2018
  • The rapid increase in genetic dataset volume has demanded extensive adoption of biological knowledge to reduce the computational complexity, and the biological pathway is one well-known source of such knowledge. In this regard, we have introduced a novel statistical method that enables the pathway-based association study of large-scale genetic dataset-namely, PHARAOH. However, researcher-level application of the PHARAOH method has been limited by a lack of generally used file formats and the absence of various quality control options that are essential to practical analysis. In order to overcome these limitations, we introduce our integration of the PHARAOH method into our recently developed all-in-one workbench. The proposed new PHARAOH program not only supports various de facto standard genetic data formats but also provides many quality control measures and filters based on those measures. We expect that our updated PHARAOH provides advanced accessibility of the pathway-level analysis of large-scale genetic datasets to researchers.

An Improved Deep Learning Method for Animal Images (동물 이미지를 위한 향상된 딥러닝 학습)

  • Wang, Guangxing;Shin, Seong-Yoon;Shin, Kwang-Weong;Lee, Hyun-Chang
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2019.01a
    • /
    • pp.123-124
    • /
    • 2019
  • This paper proposes an improved deep learning method based on small data sets for animal image classification. Firstly, we use a CNN to build a training model for small data sets, and use data augmentation to expand the data samples of the training set. Secondly, using the pre-trained network on large-scale datasets, such as VGG16, the bottleneck features in the small dataset are extracted and to be stored in two NumPy files as new training datasets and test datasets. Finally, training a fully connected network with the new datasets. In this paper, we use Kaggle famous Dogs vs Cats dataset as the experimental dataset, which is a two-category classification dataset.

  • PDF

STAR-24K: A Public Dataset for Space Common Target Detection

  • Zhang, Chaoyan;Guo, Baolong;Liao, Nannan;Zhong, Qiuyun;Liu, Hengyan;Li, Cheng;Gong, Jianglei
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.2
    • /
    • pp.365-380
    • /
    • 2022
  • The target detection algorithm based on supervised learning is the current mainstream algorithm for target detection. A high-quality dataset is the prerequisite for the target detection algorithm to obtain good detection performance. The larger the number and quality of the dataset, the stronger the generalization ability of the model, that is, the dataset determines the upper limit of the model learning. The convolutional neural network optimizes the network parameters in a strong supervision method. The error is calculated by comparing the predicted frame with the manually labeled real frame, and then the error is passed into the network for continuous optimization. Strongly supervised learning mainly relies on a large number of images as models for continuous learning, so the number and quality of images directly affect the results of learning. This paper proposes a dataset STAR-24K (meaning a dataset for Space TArget Recognition with more than 24,000 images) for detecting common targets in space. Since there is currently no publicly available dataset for space target detection, we extracted some pictures from a series of channels such as pictures and videos released by the official websites of NASA (National Aeronautics and Space Administration) and ESA (The European Space Agency) and expanded them to 24,451 pictures. We evaluate popular object detection algorithms to build a benchmark. Our STAR-24K dataset is publicly available at https://github.com/Zzz-zcy/STAR-24K.