• Title/Summary/Keyword: training data

Search Result 7,286, Processing Time 0.033 seconds

Design and Implementation of Cyber Warfare Training Data Set Generation Method based on Traffic Distribution Plan (트래픽 유통계획 기반 사이버전 훈련데이터셋 생성방법 설계 및 구현)

  • Kim, Yong Hyun;Ahn, Myung Kil
    • Convergence Security Journal
    • /
    • v.20 no.4
    • /
    • pp.71-80
    • /
    • 2020
  • In order to provide realistic traffic to the cyber warfare training system, it is necessary to prepare a traffic distribution plan in advance and to create a training data set using normal/threat data sets. This paper presents the design and implementation results of a method for creating a traffic distribution plan and a training data set to provide background traffic like a real environment to a cyber warfare training system. We propose a method of a traffic distribution plan by using the network topology of the training environment to distribute traffic and the traffic attribute information collected in real and simulated environments. We propose a method of generating a training data set according to a traffic distribution plan using a unit traffic and a mixed traffic method using the ratio of the protocol. Using the implemented tool, a traffic distribution plan was created, and the training data set creation result according to the distribution plan was confirmed.

A Content Analysis of Research Data Management Training Programs at the University Libraries in North America: Focusing on Data Literacy Competencies (북미 대학도서관 연구데이터 관리 교육 프로그램 내용 분석: 데이터 리터러시 세부 역량을 중심으로)

  • Kim, Jihyun
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.4
    • /
    • pp.7-36
    • /
    • 2018
  • This study aimed to analyze the content of Records Data Management (RDM) training programs provided by 51 out of 121 university libraries in North America that implemented RDM services, and to provide implications from the results. For the content analysis, 317 titles of classroom training programs and 42 headings at the highest level from the tables of content of online tutorials were collected and coded based on 12 data literacy competencies identified from previous studies. Among classroom training programs, those regarding data processing and analysis competency were offered the most. The highest number of the libraries provided classroom training programs in relation to data management and organization competency. The third most classroom training programs dealt with data visualization and representation competency. However, each of the remaining 9 competencies was covered by only a few classroom training programs, and this implied that classroom training programs focused on the particular data literacy competencies. There were five university libraries that developed and provided their own online tutorials. The analysis of the headings showed that the competencies of data preservation, ethics and data citation, and data management and organization were mainly covered and the difference existed in the competencies stressed by the classroom training programs. For effective RDM training program, it is necessary to understand and support the education of data literacy competencies that researchers need to draw research results, in addition to competencies that university librarians traditionally have taught and emphasized. It is also needed to develop educational resources that support continuing education for the librarians involved in RDM services.

Dynamically weighted loss based domain adversarial training for children's speech recognition (어린이 음성인식을 위한 동적 가중 손실 기반 도메인 적대적 훈련)

  • Seunghee, Ma
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.6
    • /
    • pp.647-654
    • /
    • 2022
  • Although the fields in which is utilized children's speech recognition is on the rise, the lack of quality data is an obstacle to improving children's speech recognition performance. This paper proposes a new method for improving children's speech recognition performance by additionally using adult speech data. The proposed method is a transformer based domain adversarial training using dynamically weighted loss to effectively address the data imbalance gap between age that grows as the amount of adult training data increases. Specifically, the degree of class imbalance in the mini-batch during training was quantified, and the loss function was defined and used so that the smaller the data, the greater the weight. Experiments validate the utility of proposed domain adversarial training following asymmetry between adults and children training data. Experiments show that the proposed method has higher children's speech recognition performance than traditional domain adversarial training method under all conditions in which asymmetry between age occurs in the training data.

Automatic Classification Method for Time-Series Image Data using Reference Map (Reference Map을 이용한 시계열 image data의 자동분류법)

  • Hong, Sun-Pyo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.16 no.2
    • /
    • pp.58-65
    • /
    • 1997
  • A new automatic classification method with high and stable accuracy for time-series image data is presented in this paper. This method is based on prior condition that a classified map of the target area already exists, or at least one of the time-series image data had been classified. The classified map is used as a reference map to specify training areas of classification categories. The new automatic classification method consists of five steps, i.e., extraction of training data using reference map, detection of changed pixels based upon the homogeneity of training data, clustering of changed pixels, reconstruction of training data, and classification as like maximum likelihood classifier. In order to evaluate the performance of this method qualitatively, four time-series Landsat TM image data were classified by using this method and a conventional method which needs a skilled operator. As a results, we could get classified maps with high reliability and fast throughput, without a skilled operator.

  • PDF

Performance Comparison between Neural Network and Genetic Programming Using Gas Furnace Data

  • Bae, Hyeon;Jeon, Tae-Ryong;Kim, Sung-Shin
    • Journal of information and communication convergence engineering
    • /
    • v.6 no.4
    • /
    • pp.448-453
    • /
    • 2008
  • This study describes design and development techniques of estimation models for process modeling. One case study is undertaken to design a model using standard gas furnace data. Neural networks (NN) and genetic programming (GP) are each employed to model the crucial relationships between input factors and output responses. In the case study, two models were generated by using 70% training data and evaluated by using 30% testing data for genetic programming and neural network modeling. The model performance was compared by using RMSE values, which were calculated based on the model outputs. The average RMSE for training and testing were 0.8925 (training) and 0.9951 (testing) for the NN model, and 0.707227 (training) and 0.673150 (testing) for the GP model, respectively. As concern the results, the NN model has a strong advantage in model training (using the all data for training), and the GP model appears to have an advantage in model testing (using the separated data for training and testing). The performance reproducibility of the GP model is good, so this approach appears suitable for modeling physical fabrication processes.

A Survey of Applications of Artificial Intelligence Algorithms in Eco-environmental Modelling

  • Kim, Kang-Suk;Park, Joon-Hong
    • Environmental Engineering Research
    • /
    • v.14 no.2
    • /
    • pp.102-110
    • /
    • 2009
  • Application of artificial intelligence (AI) approaches in eco-environmental modeling has gradually increased for the last decade. Comprehensive understanding and evaluation on the applicability of this approach to eco-environmental modeling are needed. In this study, we reviewed the previous studies that used AI-techniques in eco-environmental modeling. Decision Tree (DT) and Artificial Neural Network (ANN) were found to be major AI algorithms preferred by researchers in ecological and environmental modeling areas. When the effect of the size of training data on model prediction accuracy was explored using the data from the previous studies, the prediction accuracy and the size of training data showed nonlinear correlation, which was best-described by hyperbolic saturation function among the tested nonlinear functions including power and logarithmic functions. The hyperbolic saturation equations were proposed to be used as a guideline for optimizing the size of training data set, which is critically important in designing the field experiments required for training AI-based eco-environmental modeling.

Software Fault Prediction using Semi-supervised Learning Methods (세미감독형 학습 기법을 사용한 소프트웨어 결함 예측)

  • Hong, Euyseok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.19 no.3
    • /
    • pp.127-133
    • /
    • 2019
  • Most studies of software fault prediction have been about supervised learning models that use only labeled training data. Although supervised learning usually shows high prediction performance, most development groups do not have sufficient labeled data. Unsupervised learning models that use only unlabeled data for training are difficult to build and show poor performance. Semi-supervised learning models that use both labeled data and unlabeled data can solve these problems. Self-training technique requires the fewest assumptions and constraints among semi-supervised techniques. In this paper, we implemented several models using self-training algorithms and evaluated them using Accuracy and AUC. As a result, YATSI showed the best performance.

Performance Improvement of Nearest-neighbor Classification Learning through Prototype Selections (프로토타입 선택을 이용한 최근접 분류 학습의 성능 개선)

  • Hwang, Doo-Sung
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.49 no.2
    • /
    • pp.53-60
    • /
    • 2012
  • Nearest-neighbor classification predicts the class of an input data with the most frequent class among the near training data of the input data. Even though nearest-neighbor classification doesn't have a training stage, all of the training data are necessary in a predictive stage and the generalization performance depends on the quality of training data. Therefore, as the training data size increase, a nearest-neighbor classification requires the large amount of memory and the large computation time in prediction. In this paper, we propose a prototype selection algorithm that predicts the class of test data with the new set of prototypes which are near-boundary training data. Based on Tomek links and distance metric, the proposed algorithm selects boundary data and decides whether the selected data is added to the set of prototypes by considering classes and distance relationships. In the experiments, the number of prototypes is much smaller than the size of original training data and we takes advantages of storage reduction and fast prediction in a nearest-neighbor classification.

Big Data Analysis on the Perception of Home Training According to the Implementation of COVID-19 Social Distancing

  • Hyun-Chang Keum;Kyung-Won Byun
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.15 no.3
    • /
    • pp.211-218
    • /
    • 2023
  • Due to the implementation of COVID-19 distancing, interest and users in 'home training' are rapidly increasing. Therefore, the purpose of this study is to identify the perception of 'home training' through big data analysis on social media channels and provide basic data to related business sector. Social media channels collected big data from various news and social content provided on Naver and Google sites. Data for three years from March 22, 2020 were collected based on the time when COVID-19 distancing was implemented in Korea. The collected data included 4,000 Naver blogs, 2,673 news, 4,000 cafes, 3,989 knowledge IN, and 953 Google channel news. These data analyzed TF and TF-IDF through text mining, and through this, semantic network analysis was conducted on 70 keywords, big data analysis programs such as Textom and Ucinet were used for social big data analysis, and NetDraw was used for visualization. As a result of text mining analysis, 'home training' was found the most frequently in relation to TF with 4,045 times. The next order is 'exercise', 'Homt', 'house', 'apparatus', 'recommendation', and 'diet'. Regarding TF-IDF, the main keywords are 'exercise', 'apparatus', 'home', 'house', 'diet', 'recommendation', and 'mat'. Based on these results, 70 keywords with high frequency were extracted, and then semantic indicators and centrality analysis were conducted. Finally, through CONCOR analysis, it was clustered into 'purchase cluster', 'equipment cluster', 'diet cluster', and 'execute method cluster'. For the results of these four clusters, basic data on the 'home training' business sector were presented based on consumers' main perception of 'home training' and analysis of the meaning network.

An Efficient Detection Method for Rail Surface Defect using Limited Label Data (한정된 레이블 데이터를 이용한 효율적인 철도 표면 결함 감지 방법)

  • Seokmin Han
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.24 no.1
    • /
    • pp.83-88
    • /
    • 2024
  • In this research, we propose a Semi-Supervised learning based railroad surface defect detection method. The Resnet50 model, pretrained on ImageNet, was employed for the training. Data without labels are randomly selected, and then labeled to train the ResNet50 model. The trained model is used to predict the results of the remaining unlabeled training data. The predicted values exceeding a certain threshold are selected, sorted in descending order, and added to the training data. Pseudo-labeling is performed based on the class with the highest probability during this process. An experiment was conducted to assess the overall class classification performance based on the initial number of labeled data. The results showed an accuracy of 98% at best with less than 10% labeled training data compared to the overall training data.