• Title/Summary/Keyword: datasets

Search Result 2,005, Processing Time 0.03 seconds

Analysis of differences in human leukocyte antigen between the two Wellcome Trust Case Control Consortium control datasets

  • Jang, Chloe Soohyun;Choi, Wanson;Cook, Seungho;Han, Buhm
    • Genomics & Informatics
    • /
    • v.17 no.3
    • /
    • pp.29.1-29.8
    • /
    • 2019
  • The Wellcome Trust Case Control Consortium (WTCCC) study was a large genome-wide association study that aimed to identify common variants associated with seven diseases. That study combined two control datasets (58C and UK Blood Services) as shared controls. Prior to using the combined controls, the WTCCC performed analyses to show that the genomic content of the control datasets was not significantly different. Recently, the analysis of human leukocyte antigen (HLA) genes has become prevalent due to the development of HLA imputation technology. In this project, we extended the between-control homogeneity analysis of the WTCCC to HLA. We imputed HLA information in the WTCCC control dataset and showed that the HLA content was not significantly different between the two control datasets, suggesting that the combined controls can be used as controls for HLA fine-mapping analysis based on HLA imputation.

Evaluation of the Total Column Ozone in the Reanalysis Datasets over East Asia (동아시아 지역 오존 전량 재분석 자료의 검증)

  • Han, Bo-Reum;Oh, Jiyoung;Park, Sunmin;Son, Seok-Woo
    • Atmosphere
    • /
    • v.29 no.5
    • /
    • pp.659-669
    • /
    • 2019
  • This study assesses the quality of the total column ozone (TCO) data from five reanalysis datasets against nine independent observation in East Asia. The assessed datasets are the ECMWF Interim reanalysis (ERAI), Monitoring Atmosphere Composition and Climate reanalysis (MACC), Copernicus Atmosphere Monitoring Service reanalysis (CAMS), the NASA Modern-Era Retrospective analysis for Research and Applications, Version2 (MERRA2), and NCEP Climate Forecast System Reanalysis (CFSR). All datasets reasonably well capture the spatial distribution, annual cycle and interannual variability of TCO in East Asia. In particular, characteristics of TCO according to the latitude difference were similar at all points with a maximum bias of less than about 4%. Among them, CAMS and CFSR show the smallest mean bias and root-mean square error across all nine ground-based observations. This result indicates that while TCO data in modern reanalyses are reasonably good, CAMS and CFSR TCO data are the best for analysing the spatio-temporal variability and change of TCO in East Asia.

Imbalanced SVM-Based Anomaly Detection Algorithm for Imbalanced Training Datasets

  • Wang, GuiPing;Yang, JianXi;Li, Ren
    • ETRI Journal
    • /
    • v.39 no.5
    • /
    • pp.621-631
    • /
    • 2017
  • Abnormal samples are usually difficult to obtain in production systems, resulting in imbalanced training sample sets. Namely, the number of positive samples is far less than the number of negative samples. Traditional Support Vector Machine (SVM)-based anomaly detection algorithms perform poorly for highly imbalanced datasets: the learned classification hyperplane skews toward the positive samples, resulting in a high false-negative rate. This article proposes a new imbalanced SVM (termed ImSVM)-based anomaly detection algorithm, which assigns a different weight for each positive support vector in the decision function. ImSVM adjusts the learned classification hyperplane to make the decision function achieve a maximum GMean measure value on the dataset. The above problem is converted into an unconstrained optimization problem to search the optimal weight vector. Experiments are carried out on both Cloud datasets and Knowledge Discovery and Data Mining datasets to evaluate ImSVM. Highly imbalanced training sample sets are constructed. The experimental results show that ImSVM outperforms over-sampling techniques and several existing imbalanced SVM-based techniques.

An enhanced feature selection filter for classification of microarray cancer data

  • Mazumder, Dilwar Hussain;Veilumuthu, Ramachandran
    • ETRI Journal
    • /
    • v.41 no.3
    • /
    • pp.358-370
    • /
    • 2019
  • The main aim of this study is to select the optimal set of genes from microarray cancer datasets that contribute to the prediction of specific cancer types. This study proposes the enhancement of the feature selection filter algorithm based on Joe's normalized mutual information and its use for gene selection. The proposed algorithm is implemented and evaluated on seven benchmark microarray cancer datasets, namely, central nervous system, leukemia (binary), leukemia (3 class), leukemia (4 class), lymphoma, mixed lineage leukemia, and small round blue cell tumor, using five well-known classifiers, including the naive Bayes, radial basis function network, instance-based classifier, decision-based table, and decision tree. An average increase in the prediction accuracy of 5.1% is observed on all seven datasets averaged over all five classifiers. The average reduction in training time is 2.86 seconds. The performance of the proposed method is also compared with those of three other popular mutual information-based feature selection filters, namely, information gain, gain ratio, and symmetric uncertainty. The results are impressive when all five classifiers are used on all the datasets.

ETLi: Efficiently annotated traffic LiDAR dataset using incremental and suggestive annotation

  • Kang, Jungyu;Han, Seung-Jun;Kim, Nahyeon;Min, Kyoung-Wook
    • ETRI Journal
    • /
    • v.43 no.4
    • /
    • pp.630-639
    • /
    • 2021
  • Autonomous driving requires a computerized perception of the environment for safety and machine-learning evaluation. Recognizing semantic information is difficult, as the objective is to instantly recognize and distinguish items in the environment. Training a model with real-time semantic capability and high reliability requires extensive and specialized datasets. However, generalized datasets are unavailable and are typically difficult to construct for specific tasks. Hence, a light detection and ranging semantic dataset suitable for semantic simultaneous localization and mapping and specialized for autonomous driving is proposed. This dataset is provided in a form that can be easily used by users familiar with existing two-dimensional image datasets, and it contains various weather and light conditions collected from a complex and diverse practical setting. An incremental and suggestive annotation routine is proposed to improve annotation efficiency. A model is trained to simultaneously predict segmentation labels and suggest class-representative frames. Experimental results demonstrate that the proposed algorithm yields a more efficient dataset than uniformly sampled datasets.

K-Means Clustering with Deep Learning for Fingerprint Class Type Prediction

  • Mukoya, Esther;Rimiru, Richard;Kimwele, Michael;Mashava, Destine
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.3
    • /
    • pp.29-36
    • /
    • 2022
  • In deep learning classification tasks, most models frequently assume that all labels are available for the training datasets. As such strategies to learn new concepts from unlabeled datasets are scarce. In fingerprint classification tasks, most of the fingerprint datasets are labelled using the subject/individual and fingerprint datasets labelled with finger type classes are scarce. In this paper, authors have developed approaches of classifying fingerprint images using the majorly known fingerprint classes. Our study provides a flexible method to learn new classes of fingerprints. Our classifier model combines both the clustering technique and use of deep learning to cluster and hence label the fingerprint images into appropriate classes. The K means clustering strategy explores the label uncertainty and high-density regions from unlabeled data to be clustered. Using similarity index, five clusters are created. Deep learning is then used to train a model using a publicly known fingerprint dataset with known finger class types. A prediction technique is then employed to predict the classes of the clusters from the trained model. Our proposed model is better and has less computational costs in learning new classes and hence significantly saving on labelling costs of fingerprint images.

Validation of Semantic Segmentation Dataset for Autonomous Driving (승용자율주행을 위한 의미론적 분할 데이터셋 유효성 검증)

  • Gwak, Seoku;Na, Hoyong;Kim, Kyeong Su;Song, EunJi;Jeong, Seyoung;Lee, Kyewon;Jeong, Jihyun;Hwang, Sung-Ho
    • Journal of Drive and Control
    • /
    • v.19 no.4
    • /
    • pp.104-109
    • /
    • 2022
  • For autonomous driving research using AI, datasets collected from road environments play an important role. In other countries, various datasets such as CityScapes, A2D2, and BDD have already been released, but datasets suitable for the domestic road environment still need to be provided. This paper analyzed and verified the dataset reflecting the Korean driving environment. In order to verify the training dataset, the class imbalance was confirmed by comparing the number of pixels and instances of the dataset. A similar A2D2 dataset was trained with the same deep learning model, ConvNeXt, to compare and verify the constructed dataset. IoU was compared for the same class between two datasets with ConvNeXt and mIoU was compared. In this paper, it was confirmed that the collected dataset reflecting the driving environment of Korea is suitable for learning.

Construction of Database for Deep Learning-based Occlusion Area Detection in the Virtual Environment (가상 환경에서의 딥러닝 기반 폐색영역 검출을 위한 데이터베이스 구축)

  • Kim, Kyeong Su;Lee, Jae In;Gwak, Seok Woo;Kang, Won Yul;Shin, Dae Young;Hwang, Sung Ho
    • Journal of Drive and Control
    • /
    • v.19 no.3
    • /
    • pp.9-15
    • /
    • 2022
  • This paper proposes a method for constructing and verifying datasets used in deep learning technology, to prevent safety accidents in automated construction machinery or autonomous vehicles. Although open datasets for developing image recognition technologies are challenging to meet requirements desired by users, this study proposes the interface of virtual simulators to facilitate the creation of training datasets desired by users. The pixel-level training image dataset was verified by creating scenarios, including various road types and objects in a virtual environment. Detecting an object from an image may interfere with the accurate path determination due to occlusion areas covered by another object. Thus, we construct a database, for developing an occlusion area detection algorithm in a virtual environment. Additionally, we present the possibility of its use as a deep learning dataset to calculate a grid map, that enables path search considering occlusion areas. Custom datasets are built using the RDBMS system.

Impacts of label quality on performance of steel fatigue crack recognition using deep learning-based image segmentation

  • Hsu, Shun-Hsiang;Chang, Ting-Wei;Chang, Chia-Ming
    • Smart Structures and Systems
    • /
    • v.29 no.1
    • /
    • pp.207-220
    • /
    • 2022
  • Structural health monitoring (SHM) plays a vital role in the maintenance and operation of constructions. In recent years, autonomous inspection has received considerable attention because conventional monitoring methods are inefficient and expensive to some extent. To develop autonomous inspection, a potential approach of crack identification is needed to locate defects. Therefore, this study exploits two deep learning-based segmentation models, DeepLabv3+ and Mask R-CNN, for crack segmentation because these two segmentation models can outperform other similar models on public datasets. Additionally, impacts of label quality on model performance are explored to obtain an empirical guideline on the preparation of image datasets. The influence of image cropping and label refining are also investigated, and different strategies are applied to the dataset, resulting in six alternated datasets. By conducting experiments with these datasets, the highest mean Intersection-over-Union (mIoU), 75%, is achieved by Mask R-CNN. The rise in the percentage of annotations by image cropping improves model performance while the label refining has opposite effects on the two models. As the label refining results in fewer error annotations of cracks, this modification enhances the performance of DeepLabv3+. Instead, the performance of Mask R-CNN decreases because fragmented annotations may mistake an instance as multiple instances. To sum up, both DeepLabv3+ and Mask R-CNN are capable of crack identification, and an empirical guideline on the data preparation is presented to strengthen identification successfulness via image cropping and label refining.

Generation of Super-Resolution Benchmark Dataset for Compact Advanced Satellite 500 Imagery and Proof of Concept Results

  • Yonghyun Kim;Jisang Park;Daesub Yoon
    • Korean Journal of Remote Sensing
    • /
    • v.39 no.4
    • /
    • pp.459-466
    • /
    • 2023
  • In the last decade, artificial intelligence's dramatic advancement with the development of various deep learning techniques has significantly contributed to remote sensing fields and satellite image applications. Among many prominent areas, super-resolution research has seen substantial growth with the release of several benchmark datasets and the rise of generative adversarial network-based studies. However, most previously published remote sensing benchmark datasets represent spatial resolution within approximately 10 meters, imposing limitations when directly applying for super-resolution of small objects with cm unit spatial resolution. Furthermore, if the dataset lacks a global spatial distribution and is specialized in particular land covers, the consequent lack of feature diversity can directly impact the quantitative performance and prevent the formation of robust foundation models. To overcome these issues, this paper proposes a method to generate benchmark datasets by simulating the modulation transfer functions of the sensor. The proposed approach leverages the simulation method with a solid theoretical foundation, notably recognized in image fusion. Additionally, the generated benchmark dataset is applied to state-of-the-art super-resolution base models for quantitative and visual analysis and discusses the shortcomings of the existing datasets. Through these efforts, we anticipate that the proposed benchmark dataset will facilitate various super-resolution research shortly in Korea.