• Title/Summary/Keyword: dataset

Search Result 3,881, Processing Time 0.031 seconds

A Study on the Functional Requirements of Record Production System for Dataset : Focused on Case Study of KR Asset management system (데이터세트 생산시스템 기능요건 연구 KR 재산관리시스템 사례를 중심으로)

  • Ryu, Hanjo;Baek, Youngmi;Yim, Jinhee
    • The Korean Journal of Archival Studies
    • /
    • no.70
    • /
    • pp.5-40
    • /
    • 2021
  • Administrative information dataset records produced by various systems designed for work are difficult to manage on a case-by-case basis, requiring separate procedures to identify and evaluate data-sets. Identified data set records are apprasal and transferred to the records management system or disposed of. In this process, sufficient records management elements must be reflected in the production system itself in order to adhere to the principles of record management. In this paper, the functional requirements of the production system to accurately identify and safely manage data-sets were derived and applied based on the case of the KR property management system. It is hoped that this research on functional requirements of production systems will be added to lead to the creation of standards for functional requirements of data set production systems.

Study on Public Institution Dataset Identification and Evaluation Process : Focusing on the Case of KR Electronic Procurement System (공공기관 데이터세트 식별과 평가 절차 연구 국가철도공단 전자조달시스템 사례를 중심으로)

  • Hwang, jin hyun;Baek, young mi;Yim, jin hee
    • The Korean Journal of Archival Studies
    • /
    • no.70
    • /
    • pp.41-83
    • /
    • 2021
  • After the revision of the Enforcement Decree of the Public Records Act, the archives created a management standard table for data set records management and performed management and control. Therefore, in this study, the data set record identification procedure and evaluation index were developed for systematic data set record management of archives. By applying this, a management standard table was prepared after identifying the records of 8 datasets in kr's electronic procurement system, and the evaluation was carried out according to the evaluation index, and the retention period, transfer, and collection were determined. It is hoped that this case study will be of practical use to the archives at a time when concrete examples of procedures for the management of dataset records are lacking.

Novelty Detection on Web-server Log Dataset (웹서버 로그 데이터의 이상상태 탐지 기법)

  • Lee, Hwaseong;Kim, Ki Su
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.10
    • /
    • pp.1311-1319
    • /
    • 2019
  • Currently, the web environment is a commonly used area for sharing information and conducting business. It is becoming an attack point for external hacking targeting on personal information leakage or system failure. Conventional signature-based detection is used in cyber threat but signature-based detection has a limitation that it is difficult to detect the pattern when it is changed like polymorphism. In particular, injection attack is known to the most critical security risks based on web vulnerabilities and various variants are possible at any time. In this paper, we propose a novelty detection technique to detect abnormal state that deviates from the normal state on web-server log dataset(WSLD). The proposed method is a machine learning-based technique to detect a minor anomalous data that tends to be different from a large number of normal data after replacing strings in web-server log dataset with vectors using machine learning-based embedding algorithm.

Hierarchical grouping recommendation system based on the attributes of contents: a case study of 'The Movie Dataset' (콘텐츠 속성에 따른 계층적 그룹화 추천시스템: 'The Movie Dataset' 분석사례연구)

  • Kim, Yoon Kyoung;Yeo, In-Kwon
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.6
    • /
    • pp.833-842
    • /
    • 2020
  • Global platforms such as Netflix, Amazon, and YouTube have developed a precise recommendation system based on various information from large set of customers and many of the items recommended here are leading to actual purchases. In this paper, a cluster analysis was conducted according to the attribute of the content, expecting that there would be a difference in user preferences according to the attribute of the recommended content. Gower distance was used for use regardless of the type of variables. In this paper, using the data of movie rating site 'The Movie Dataset', the users were grouped hierarchically and recommended movies based on genre, director and actor variables. To evaluate the recommended systems proposed, user group was divided into train set and test set to examine the precision. The results showed that proposed algorithms have far higher precision than UBCF.

Sentence Filtering Dataset Construction Method about Web Corpus (웹 말뭉치에 대한 문장 필터링 데이터 셋 구축 방법)

  • Nam, Chung-Hyeon;Jang, Kyung-Sik
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.11
    • /
    • pp.1505-1511
    • /
    • 2021
  • Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%.

A Tuberculosis Detection Method Using Attention and Sparse R-CNN

  • Xu, Xuebin;Zhang, Jiada;Cheng, Xiaorui;Lu, Longbin;Zhao, Yuqing;Xu, Zongyu;Gu, Zhuangzhuang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.7
    • /
    • pp.2131-2153
    • /
    • 2022
  • To achieve accurate detection of tuberculosis (TB) areas in chest radiographs, we design a chest X-ray TB area detection algorithm. The algorithm consists of two stages: the chest X-ray TB classification network (CXTCNet) and the chest X-ray TB area detection network (CXTDNet). CXTCNet is used to judge the presence or absence of TB areas in chest X-ray images, thereby excluding the influence of other lung diseases on the detection of TB areas. It can reduce false positives in the detection network and improve the accuracy of detection results. In CXTCNet, we propose a channel attention mechanism (CAM) module and combine it with DenseNet. This module enables the network to learn more spatial and channel features information about chest X-ray images, thereby improving network performance. CXTDNet is a design based on a sparse object detection algorithm (Sparse R-CNN). A group of fixed learnable proposal boxes and learnable proposal features are using for classification and location. The predictions of the algorithm are output directly without non-maximal suppression post-processing. Furthermore, we use CLAHE to reduce image noise and improve image quality for data preprocessing. Experiments on dataset TBX11K show that the accuracy of the proposed CXTCNet is up to 99.10%, which is better than most current TB classification algorithms. Finally, our proposed chest X-ray TB detection algorithm could achieve AP of 45.35% and AP50 of 74.20%. We also establish a chest X-ray TB dataset with 304 sheets. And experiments on this dataset showed that the accuracy of the diagnosis was comparable to that of radiologists. We hope that our proposed algorithm and established dataset will advance the field of TB detection.

Normal data based rotating machine anomaly detection using CNN with self-labeling

  • Bae, Jaewoong;Jung, Wonho;Park, Yong-Hwa
    • Smart Structures and Systems
    • /
    • v.29 no.6
    • /
    • pp.757-766
    • /
    • 2022
  • To train deep learning algorithms, a sufficient number of data are required. However, in most engineering systems, the acquisition of fault data is difficult or sometimes not feasible, while normal data are secured. The dearth of data is one of the major challenges to developing deep learning models, and fault diagnosis in particular cannot be made in the absence of fault data. With this context, this paper proposes an anomaly detection methodology for rotating machines using only normal data with self-labeling. Since only normal data are used for anomaly detection, a self-labeling method is used to generate a new labeled dataset. The overall procedure includes the following three steps: (1) transformation of normal data to self-labeled data based on a pretext task, (2) training the convolutional neural networks (CNN), and (3) anomaly detection using defined anomaly score based on the softmax output of the trained CNN. The softmax value of the abnormal sample shows different behavior from the normal softmax values. To verify the proposed method, four case studies were conducted, on the Case Western Reserve University (CWRU) bearing dataset, IEEE PHM 2012 data challenge dataset, PHMAP 2021 data challenge dataset, and laboratory bearing testbed; and the results were compared to those of existing machine learning and deep learning methods. The results showed that the proposed algorithm could detect faults in the bearing testbed and compressor with over 99.7% accuracy. In particular, it was possible to detect not only bearing faults but also structural faults such as unbalance and belt looseness with very high accuracy. Compared with the existing GAN, the autoencoder-based anomaly detection algorithm, the proposed method showed high anomaly detection performance.

Construction of a Spatio-Temporal Dataset for Deep Learning-Based Precipitation Nowcasting

  • Kim, Wonsu;Jang, Dongmin;Park, Sung Won;Yang, MyungSeok
    • Journal of Information Science Theory and Practice
    • /
    • v.10 no.spc
    • /
    • pp.135-142
    • /
    • 2022
  • Recently, with the development of data processing technology and the increase of computational power, methods to solving social problems using Artificial Intelligence (AI) are in the spotlight, and AI technologies are replacing and supplementing existing traditional methods in various fields. Meanwhile in Korea, heavy rain is one of the representative factors of natural disasters that cause enormous economic damage and casualties every year. Accurate prediction of heavy rainfall over the Korean peninsula is very difficult due to its geographical features, located between the Eurasian continent and the Pacific Ocean at mid-latitude, and the influence of the summer monsoon. In order to deal with such problems, the Korea Meteorological Administration operates various state-of-the-art observation equipment and a newly developed global atmospheric model system. Nevertheless, for precipitation nowcasting, the use of a separate system based on the extrapolation method is required due to the intrinsic characteristics associated with the operation of numerical weather prediction models. The predictability of existing precipitation nowcasting is reliable in the early stage of forecasting but decreases sharply as forecast lead time increases. At this point, AI technologies to deal with spatio-temporal features of data are expected to greatly contribute to overcoming the limitations of existing precipitation nowcasting systems. Thus, in this project the dataset required to develop, train, and verify deep learning-based precipitation nowcasting models has been constructed in a regularized form. The dataset not only provides various variables obtained from multiple sources, but also coincides with each other in spatio-temporal specifications.

Small-Scale Object Detection Label Reassignment Strategy

  • An, Jung-In;Kim, Yoon;Choi, Hyun-Soo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.12
    • /
    • pp.77-84
    • /
    • 2022
  • In this paper, we propose a Label Reassignment Strategy to improve the performance of an object detection algorithm. Our approach involves two stages: an inference stage and an assignment stage. In the inference stage, we perform multi-scale inference with predefined scale sizes on a trained model and re-infer masked images to obtain robust classification results. In the assignment stage, we calculate the IoU between bounding boxes to remove duplicates. We also check box and class occurrence between the detection result and annotation label to re-assign the dominant class type. We trained the YOLOX-L model with the re-annotated dataset to validate our strategy. The model achieved a 3.9% improvement in mAP and 3x better performance on AP_S compared to the model trained with the original dataset. Our results demonstrate that the proposed Label Reassignment Strategy can effectively improve the performance of an object detection model.

Boosting the Performance of the Predictive Model on the Imbalanced Dataset Using SVM Based Bagging and Out-of-Distribution Detection (SVM 기반 Bagging과 OoD 탐색을 활용한 제조공정의 불균형 Dataset에 대한 예측모델의 성능향상)

  • Kim, Jong Hoon;Oh, Hayoung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.11
    • /
    • pp.455-464
    • /
    • 2022
  • There are two unique characteristics of the datasets from a manufacturing process. They are the severe class imbalance and lots of Out-of-Distribution samples. Some good strategies such as the oversampling over the minority class, and the down-sampling over the majority class, are well known to handle the class imbalance. In addition, SMOTE has been chosen to address the issue recently. But, Out-of-Distribution samples have been studied just with neural networks. It seems to be hardly shown that Out-of-Distribution detection is applied to the predictive model using conventional machine learning algorithms such as SVM, Random Forest and KNN. It is known that conventional machine learning algorithms are much better than neural networks in prediction performance, because neural networks are vulnerable to over-fitting and requires much bigger dataset than conventional machine learning algorithms does. So, we suggests a new approach to utilize Out-of-Distribution detection based on SVM algorithm. In addition to that, bagging technique will be adopted to improve the precision of the model.