• 제목/요약/키워드: Training Data Set

Search Result 814, Processing Time 0.021 seconds

Development of kNN QSAR Models for 3-Arylisoquinoline Antitumor Agents

  • Tropsha, Alexander;Golbraikh, Alexander;Cho, Won-Jea
    • Bulletin of the Korean Chemical Society
    • /
    • v.32 no.7
    • /
    • pp.2397-2404
    • /
    • 2011
  • Variable selection k nearest neighbor QSAR modeling approach was applied to a data set of 80 3-arylisoquinolines exhibiting cytotoxicity against human lung tumor cell line (A-549). All compounds were characterized with molecular topology descriptors calculated with the MolconnZ program. Seven compounds were randomly selected from the original dataset and used as an external validation set. The remaining subset of 73 compounds was divided into multiple training (56 to 61 compounds) and test (17 to 12 compounds) sets using a chemical diversity sampling method developed in this group. Highly predictive models characterized by the leave-one out cross-validated $R^2$ ($q^2$) values greater than 0.8 for the training sets and $R^2$ values greater than 0.7 for the test sets have been obtained. The robustness of models was confirmed by the Y-randomization test: all models built using training sets with randomly shuffled activities were characterized by low $q^2{\leq}0.26$ and $R^2{\leq}0.22$ for training and test sets, respectively. Twelve best models (with the highest values of both $q^2$ and $R^2$) predicted the activities of the external validation set of seven compounds with $R^2$ ranging from 0.71 to 0.93.

Finding Unexpected Test Accuracy by Cross Validation in Machine Learning

  • Yoon, Hoijin
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.12spc
    • /
    • pp.549-555
    • /
    • 2021
  • Machine Learning(ML) splits data into 3 parts, which are usually 60% for training, 20% for validation, and 20% for testing. It just splits quantitatively instead of selecting each set of data by a criterion, which is very important concept for the adequacy of test data. ML measures a model's accuracy by applying a set of validation data, and revises the model until the validation accuracy reaches on a certain level. After the validation process, the complete model is tested with the set of test data, which are not seen by the model yet. If the set of test data covers the model's attributes well, the test accuracy will be close to the validation accuracy of the model. To make sure that ML's set of test data works adequately, we design an experiment and see if the test accuracy of model is always close to its validation adequacy as expected. The experiment builds 100 different SVM models for each of six data sets published in UCI ML repository. From the test accuracy and its validation accuracy of 600 cases, we find some unexpected cases, where the test accuracy is very different from its validation accuracy. Consequently, it is not always true that ML's set of test data is adequate to assure a model's quality.

Training Data Sets Construction from Large Data Set for PCB Character Recognition

  • NDAYISHIMIYE, Fabrice;Gang, Sumyung;Lee, Joon Jae
    • Journal of Multimedia Information System
    • /
    • v.6 no.4
    • /
    • pp.225-234
    • /
    • 2019
  • Deep learning has become increasingly popular in both academic and industrial areas nowadays. Various domains including pattern recognition, Computer vision have witnessed the great power of deep neural networks. However, current studies on deep learning mainly focus on quality data sets with balanced class labels, while training on bad and imbalanced data set have been providing great challenges for classification tasks. We propose in this paper a method of data analysis-based data reduction techniques for selecting good and diversity data samples from a large dataset for a deep learning model. Furthermore, data sampling techniques could be applied to decrease the large size of raw data by retrieving its useful knowledge as representatives. Therefore, instead of dealing with large size of raw data, we can use some data reduction techniques to sample data without losing important information. We group PCB characters in classes and train deep learning on the ResNet56 v2 and SENet model in order to improve the classification performance of optical character recognition (OCR) character classifier.

Video augmentation technique for human action recognition using genetic algorithm

  • Nida, Nudrat;Yousaf, Muhammad Haroon;Irtaza, Aun;Velastin, Sergio A.
    • ETRI Journal
    • /
    • v.44 no.2
    • /
    • pp.327-338
    • /
    • 2022
  • Classification models for human action recognition require robust features and large training sets for good generalization. However, data augmentation methods are employed for imbalanced training sets to achieve higher accuracy. These samples generated using data augmentation only reflect existing samples within the training set, their feature representations are less diverse and hence, contribute to less precise classification. This paper presents new data augmentation and action representation approaches to grow training sets. The proposed approach is based on two fundamental concepts: virtual video generation for augmentation and representation of the action videos through robust features. Virtual videos are generated from the motion history templates of action videos, which are convolved using a convolutional neural network, to generate deep features. Furthermore, by observing an objective function of the genetic algorithm, the spatiotemporal features of different samples are combined, to generate the representations of the virtual videos and then classified through an extreme learning machine classifier on MuHAVi-Uncut, iXMAS, and IAVID-1 datasets.

Recovery the Missing Streamflow Data on River Basin Based on the Deep Neural Network Model

  • Le, Xuan-Hien;Lee, Giha
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2019.05a
    • /
    • pp.156-156
    • /
    • 2019
  • In this study, a gated recurrent unit (GRU) network is constructed based on a deep neural network (DNN) with the aim of restoring the missing daily flow data in river basins. Lai Chau hydrological station is located upstream of the Da river basin (Vietnam) is selected as the target station for this study. Input data of the model are data on observed daily flow for 24 years from 1961 to 1984 (before Hoa Binh dam was built) at 5 hydrological stations, in which 4 gauge stations in the basin downstream and restoring - target station (Lai Chau). The total available data is divided into sections for different purposes. The data set of 23 years (1961-1983) was employed for training and validation purposes, with corresponding rates of 80% for training and 20% for validation respectively. Another data set of one year (1984) was used for the testing purpose to objectively verify the performance and accuracy of the model. Though only a modest amount of input data is required and furthermore the Lai Chau hydrological station is located upstream of the Da River, the calculated results based on the suggested model are in satisfactory agreement with observed data, the Nash - Sutcliffe efficiency (NSE) is higher than 95%. The finding of this study illustrated the outstanding performance of the GRU network model in recovering the missing flow data at Lai Chau station. As a result, DNN models, as well as GRU network models, have great potential for application within the field of hydrology and hydraulics.

  • PDF

Assessing the Extent and Rate of Deforestation in the Mountainous Tropical Forest

  • Pujiono, Eko;Lee, Woo-Kyun;Kwak, Doo-Ahn;Lee, Jong-Yeol
    • Korean Journal of Remote Sensing
    • /
    • v.27 no.3
    • /
    • pp.315-328
    • /
    • 2011
  • Landsat data incorporated with additional bands-normalized difference vegetation index (NDVI) and band ratios were used to assess the extent and rate of deforestation in the Gunung Mutis Nature Reserve (GMNR), a mountainous tropical forest in Eastern of Indonesia. Hybrid classification was chosen as the classification approach. In this approach, the unsupervised classification-iterative self-organizing data analysis (ISODATA) was used to create signature files and training data set. A statistical separability measurement-transformed divergence (TD) was used to identify the combination of bands that showed the highest distinction between the land cover classes in training data set. Supervised classification-maximum likelihood classification (MLC) was performed using selected bands and the training data set. Post-classification smoothing and accuracy assessment were applied to classified image. Post-classification comparison was used to assess the extent of deforestation, of which the rate of deforestation was calculated by the formula suggested by Food Agriculture Organization (FAO). The results of two periods of deforestation assessment showed that the extent of deforestation during 1989-1999 was 720.72 ha, 0.80% of annual rate of deforestation, and its extent of deforestation during 1999-2009 was 1,059.12 ha, 1.31% of annual rate of deforestation. Such results are important for the GMNR authority to establish strategies, plans and actions for combating deforestation.

Effects of Balance Training through Visual Control on Balance Ability, Postural Control, and Balance Confidence in Chronic Stroke Patients (시각 통제를 이용한 균형훈련이 만성 뇌졸중 환자의 균형능력과 자세조절, 균형자신감에 미치는 영향)

  • Jeong, Seong-Hwa;Koo, Hyun-Mo
    • PNF and Movement
    • /
    • v.18 no.1
    • /
    • pp.133-141
    • /
    • 2020
  • Purpose: The purpose of this study was to conduct balance training through vision control to improve the balance, postural control, and balance confidence and to decrease the visual and sensory dependence of stroke patients. Methods: Twenty-eight chronic stroke patients volunteered to participate in the study. They were randomly assigned to the eyes-closed and the eyes-open training groups. Three times a week for four weeks each group performed an unstable-support session and a balance training session for thirty minutes per set. Their balance, postural control, and balance confidence were assessed using BIO Rescue (BR), the postural assessment scale for stroke (PASS), and the Korean activity-specific balance confidence scale (K-ABC), respectively. All data were analyzed using SPSS version 22.0. Statistical methods before and after working around the average value of each dataset were independent T-test. The significance level for statistical analyses was set at 0.05. Results: Comparison between the groups showed statistically significant effects on all variables before and after the intervention (p < 0.05). Conclusion: This study reflected that balance-training programs involving vision control improve the balance, postural control, and balance confidence of chronic stroke patients. Thus, stroke patients should undergo training programs that increase the use of their other senses with vision control in clinical practice.

Individual factors influencing the location decisions of practicing physicians (최근 배출된 전문의의 개원지역 선택에 영향을 미치는 개인요인 분석)

  • 김창엽;윤석준;이진석;김용익
    • Health Policy and Management
    • /
    • v.9 no.3
    • /
    • pp.21-32
    • /
    • 1999
  • The purpose of this study is to assess individual decisive factors for distribution of medical specialists in Korea. A data set was constructed using several published data sources. including the Korean Medical Association's physician master file as a principal source for physician information. Linear logistic regression analysis was performed to assess the relationship between the location of private specialist clinic for practice with six variables related with individual characteristics: age. sex. location of postgraduate training hospital. location of medical school graduated, size of hospital for training, and specialty. Analysis showed that location of practice. classified into urban and rural areas, was significantly associated with the variables of sex. location of postgraduate training hospital. location of medical school. In addition, significant association was found between the location of practice which was categorized into "near-Seoul area" and others, and sex, location of postgraduate training hospital. and location of medical school. We could conclude that to improve area maldistribution of physicians locations of hospitals for training and medical schools have to have the highest priority in the policymaking.icymaking.

  • PDF

A Software Quality Prediction Model Without Training Data Set (훈련데이터 집합을 사용하지 않는 소프트웨어 품질예측 모델)

  • Hong, Euy-Seok
    • The KIPS Transactions:PartD
    • /
    • v.10D no.4
    • /
    • pp.689-696
    • /
    • 2003
  • Criticality prediction models that determine whether a design entity is fault-prone or non fault-prone are used for identifying trouble spots of software system in analysis or design phases. Many criticality prediction models for identifying fault-prone modules using complexity metrics have been suggested. But most of them need training data set. Unfortunately very few organizations have their own training data. To solve this problem, this paper builds a new prediction model, KSM, based on Kohonen SOM neural networks. KSM is implemented and compared with a well-known prediction model, BackPropagation neural network Model (BPM), considering internal characteristics, utilization cost and accuracy of prediction. As a result, this paper shows that KSM has comparative performance with BPM.

Generation of Efficient Fuzzy Classification Rules Using Evolutionary Algorithm with Data Partition Evaluation (데이터 분할 평가 진화알고리즘을 이용한 효율적인 퍼지 분류규칙의 생성)

  • Ryu, Joung-Woo;Kim, Sung-Eun;Kim, Myung-Won
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.18 no.1
    • /
    • pp.32-40
    • /
    • 2008
  • Fuzzy rules are very useful and efficient to describe classification rules especially when the attribute values are continuous and fuzzy in nature. However, it is generally difficult to determine membership functions for generating efficient fuzzy classification rules. In this paper, we propose a method of automatic generation of efficient fuzzy classification rules using evolutionary algorithm. In our method we generate a set of initial membership functions for evolutionary algorithm by supervised clustering the training data set and we evolve the set of initial membership functions in order to generate fuzzy classification rules taking into consideration both classification accuracy and rule comprehensibility. To reduce time to evaluate an individual we also propose an evolutionary algorithm with data partition evaluation in which the training data set is partitioned into a number of subsets and individuals are evaluated using a randomly selected subset of data at a time instead of the whole training data set. We experimented our algorithm with the UCI learning data sets, the experiment results showed that our method was more efficient at average compared with the existing algorithms. For the evolutionary algorithm with data partition evaluation, we experimented with our method over the intrusion detection data of KDD'99 Cup, and confirmed that evaluation time was reduced by about 70%. Compared with the KDD'99 Cup winner, the accuracy was increased by 1.54% while the cost was reduced by 20.8%.