• 제목/요약/키워드: large dataset

검색결과 547건 처리시간 0.027초

한의임상정보은행 활용도 제고를 위한 교육용 데이터 개발 (Development of Korean Medicine Data Center(KDC) Teaching Dataset to Enhance Utilization of KDC)

  • 백영화;이시우
    • 사상체질의학회지
    • /
    • 제29권3호
    • /
    • pp.242-247
    • /
    • 2017
  • Objective Korean medicine Data Center (KDC) has established large-scale biological and clinical data based on Korean medicine to demonstrate and validate its theory. The aim of this study was to develop KDC teaching dataset and user guideline to improve utilization of the KDC. Method KDC teaching dataset were selected using stratified random sampling according to the Sasang constitution (SC). This dataset included 72 variables of 500 sample subjects. The user guideline described how to conducted eight statistical analysis methods using the teaching dataset. Results The KDC teaching dataset was sampled from 200(40%) Taeeumin, 125(25%) Soeumin, and 175(35%) Soyanain. It was consisted of questionnaire (basic, habit, disease, symptom), physical exam (body measurement, blood pressure), blood exam, and expert' SC diagnosis. The usage guidelines provided instruction for users to perform several statistical analysis step by step with KDC teaching dataset. Conclusion We hope that our results will contribute to enhancing KDC utilization and understanding.

대용량의 Haplotype과 Genotype데이터에 대한 LD기반의 tagSNP 선택 시스템 (LD-based tagSNP Selection System for Large-scale Haplotype and Genotype Datasets)

  • Kim, Sang-Jun;Yeo, Sang-Soo;Kim, Sung-Kwon
    • 한국생물정보학회:학술대회논문집
    • /
    • 한국생물정보시스템생물학회 2004년도 The 3rd Annual Conference for The Korean Society for Bioinformatics Association of Asian Societies for Bioinformatics 2004 Symposium
    • /
    • pp.279-285
    • /
    • 2004
  • In the disease association study, the tagSNP selection problem is important at the view of time and cost. We developed the new tagSNP selection system that has also facilities for the haplotype reconstruction and missing data processing. In our system, we improved biological meanings using LD coefficients as well as dynamic programming method. And our system has capability of processing large -scale dataset, such as the total SNPs on a chromosome. We have tested our system with various dataset from daly et al., patil et al., HapMap Project, artificial dataset, and so on.

  • PDF

회전 영상 기반 다면 영상 데이터셋 구축 방법 (Multi-faceted Image Dataset Construction Method Based on Rotational Images.)

  • 김지성;허경용;장시웅
    • 한국정보통신학회:학술대회논문집
    • /
    • 한국정보통신학회 2021년도 추계학술대회
    • /
    • pp.75-77
    • /
    • 2021
  • 딥러닝 기술을 통해 영상 내의 객체를 찾아내기 위해서는 학습을 위한 영상 데이터셋이 필요하다. 객체의 인식률을 높이기 위해서는 많은 양의 영상 학습 데이터가 필요하다. 많은 양의 데이터셋을 구축하는 데에는 많은 비용이 들기 때문에 개인이 구축하기에 어려움이 있다. 본 논문에서는 회전 영상을 촬영하여 객체의 여러 면을 포함하는 영상 데이터셋을 보다 손쉽게 구축하는 방법을 소개한다. 회전판 위에 객체를 올려둔 뒤 촬영하고 촬영된 영상을 필요에 맞게 분할, 합성하여 데이터셋을 구축하는 방법을 제안한다.

  • PDF

한국어 립리딩: 데이터 구축 및 문장수준 립리딩 (Korean Lip-Reading: Data Construction and Sentence-Level Lip-Reading)

  • 조선영;윤수성
    • 한국군사과학기술학회지
    • /
    • 제27권2호
    • /
    • pp.167-176
    • /
    • 2024
  • Lip-reading is the task of inferring the speaker's utterance from silent video based on learning of lip movements. It is very challenging due to the inherent ambiguities present in the lip movement such as different characters that produce the same lip appearances. Recent advances in deep learning models such as Transformer and Temporal Convolutional Network have led to improve the performance of lip-reading. However, most previous works deal with English lip-reading which has limitations in directly applying to Korean lip-reading, and moreover, there is no a large scale Korean lip-reading dataset. In this paper, we introduce the first large-scale Korean lip-reading dataset with more than 120 k utterances collected from TV broadcasts containing news, documentary and drama. We also present a preprocessing method which uniformly extracts a facial region of interest and propose a transformer-based model based on grapheme unit for sentence-level Korean lip-reading. We demonstrate that our dataset and model are appropriate for Korean lip-reading through statistics of the dataset and experimental results.

Mid-level Feature Extraction Method Based Transfer Learning to Small-Scale Dataset of Medical Images with Visualizing Analysis

  • Lee, Dong-Ho;Li, Yan;Shin, Byeong-Seok
    • Journal of Information Processing Systems
    • /
    • 제16권6호
    • /
    • pp.1293-1308
    • /
    • 2020
  • In fine-tuning-based transfer learning, the size of the dataset may affect learning accuracy. When a dataset scale is small, fine-tuning-based transfer-learning methods use high computing costs, similar to a large-scale dataset. We propose a mid-level feature extractor that retrains only the mid-level convolutional layers, resulting in increased efficiency and reduced computing costs. This mid-level feature extractor is likely to provide an effective alternative in training a small-scale medical image dataset. The performance of the mid-level feature extractor is compared with the performance of low- and high-level feature extractors, as well as the fine-tuning method. First, the mid-level feature extractor takes a shorter time to converge than other methods do. Second, it shows good accuracy in validation loss evaluation. Third, it obtains an area under the ROC curve (AUC) of 0.87 in an untrained test dataset that is very different from the training dataset. Fourth, it extracts more clear feature maps about shape and part of the chest in the X-ray than fine-tuning method.

Towards Texture-Based Visualization of Multivariate Dataset

  • Mehmood, Raja Majid;Lee, Hyo Jong
    • 한국정보처리학회:학술대회논문집
    • /
    • 한국정보처리학회 2014년도 춘계학술발표대회
    • /
    • pp.582-585
    • /
    • 2014
  • Visualization is a science which makes the invisible to visible through the techniques of experimental visualization and computer-aided visualization. This paper presents the practical aspects of visualization of multivariate dataset. In this paper, we will briefly discuss a previous research work and introduce a new visualization technique which will help us to design and develop a visualization tool for experimental visualization of multivariate dataset. Our newly developed visualization tool can be used in various domains. In this paper, we have chosen a software industry as an application domain and we used the multivariate dataset of software components computed by VizzMaintenance. VizzMaintenance is software analysis tool which give us multiple software metrics of open source Java based programs. Main objective of this research is to develop a new visualization tool for large multivariate dataset which will be more efficient and easy to perceive by viewer. Perception is very important for our research work and we have decided to test the perception level of our proposed visualization approach by researchers of our research lab.

Integration of a Large-Scale Genetic Analysis Workbench Increases the Accessibility of a High-Performance Pathway-Based Analysis Method

  • Lee, Sungyoung;Park, Taesung
    • Genomics & Informatics
    • /
    • 제16권4호
    • /
    • pp.39.1-39.3
    • /
    • 2018
  • The rapid increase in genetic dataset volume has demanded extensive adoption of biological knowledge to reduce the computational complexity, and the biological pathway is one well-known source of such knowledge. In this regard, we have introduced a novel statistical method that enables the pathway-based association study of large-scale genetic dataset-namely, PHARAOH. However, researcher-level application of the PHARAOH method has been limited by a lack of generally used file formats and the absence of various quality control options that are essential to practical analysis. In order to overcome these limitations, we introduce our integration of the PHARAOH method into our recently developed all-in-one workbench. The proposed new PHARAOH program not only supports various de facto standard genetic data formats but also provides many quality control measures and filters based on those measures. We expect that our updated PHARAOH provides advanced accessibility of the pathway-level analysis of large-scale genetic datasets to researchers.

동물 이미지를 위한 향상된 딥러닝 학습 (An Improved Deep Learning Method for Animal Images)

  • 왕광싱;신성윤;신광성;이현창
    • 한국컴퓨터정보학회:학술대회논문집
    • /
    • 한국컴퓨터정보학회 2019년도 제59차 동계학술대회논문집 27권1호
    • /
    • pp.123-124
    • /
    • 2019
  • This paper proposes an improved deep learning method based on small data sets for animal image classification. Firstly, we use a CNN to build a training model for small data sets, and use data augmentation to expand the data samples of the training set. Secondly, using the pre-trained network on large-scale datasets, such as VGG16, the bottleneck features in the small dataset are extracted and to be stored in two NumPy files as new training datasets and test datasets. Finally, training a fully connected network with the new datasets. In this paper, we use Kaggle famous Dogs vs Cats dataset as the experimental dataset, which is a two-category classification dataset.

  • PDF

STAR-24K: A Public Dataset for Space Common Target Detection

  • Zhang, Chaoyan;Guo, Baolong;Liao, Nannan;Zhong, Qiuyun;Liu, Hengyan;Li, Cheng;Gong, Jianglei
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제16권2호
    • /
    • pp.365-380
    • /
    • 2022
  • The target detection algorithm based on supervised learning is the current mainstream algorithm for target detection. A high-quality dataset is the prerequisite for the target detection algorithm to obtain good detection performance. The larger the number and quality of the dataset, the stronger the generalization ability of the model, that is, the dataset determines the upper limit of the model learning. The convolutional neural network optimizes the network parameters in a strong supervision method. The error is calculated by comparing the predicted frame with the manually labeled real frame, and then the error is passed into the network for continuous optimization. Strongly supervised learning mainly relies on a large number of images as models for continuous learning, so the number and quality of images directly affect the results of learning. This paper proposes a dataset STAR-24K (meaning a dataset for Space TArget Recognition with more than 24,000 images) for detecting common targets in space. Since there is currently no publicly available dataset for space target detection, we extracted some pictures from a series of channels such as pictures and videos released by the official websites of NASA (National Aeronautics and Space Administration) and ESA (The European Space Agency) and expanded them to 24,451 pictures. We evaluate popular object detection algorithms to build a benchmark. Our STAR-24K dataset is publicly available at https://github.com/Zzz-zcy/STAR-24K.

신문기사와 소셜 미디어를 활용한 한국어 문서요약 데이터 구축 (Building a Korean Text Summarization Dataset Using News Articles of Social Media)

  • 이경호;박요한;이공주
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제9권8호
    • /
    • pp.251-258
    • /
    • 2020
  • 문서 요약을 위한 학습 데이터는 문서와 그 요약으로 구성된다. 기존의 문서 요약 데이터는 사람이 수동으로 요약을 작성하였기 때문에 대량의 데이터 확보가 어려웠다. 그렇기 때문에 온라인으로 쉽게 수집 가능하며 문서의 품질이 우수한 인터넷 신문기사가 문서 요약 연구에 많이 활용되어 왔다. 본 연구에서는 언론사가 소셜 미디어에 게시한 설명글과 제목, 부제를 본문의 요약으로 사용하여 한국어 문서 요약 데이터를 구성하는 것을 제안한다. 약 425,000개의 신문기사와 그 요약데이터를 구축할 수 있었다. 구성한 데이터의 유용성을 보이기 위해 추출 요약 시스템을 구현하였다. 본 연구에서 구축한 데이터로 학습한 교사 학습 모델과 비교사 학습 모델의 성능을 비교하였다. 실험 결과 제안한 데이터로 학습한 모델이 비교사 학습 알고리즘에 비해 더 높은 ROUGE 점수를 보였다.