• Title/Summary/Keyword: Dataset construction

Search Result 198, Processing Time 0.023 seconds

Sentence Filtering Dataset Construction Method about Web Corpus (웹 말뭉치에 대한 문장 필터링 데이터 셋 구축 방법)

  • Nam, Chung-Hyeon;Jang, Kyung-Sik
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.11
    • /
    • pp.1505-1511
    • /
    • 2021
  • Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%.

A Study on Synthetic Dataset Generation Method for Maritime Traffic Situation Awareness (해상교통 상황인지 향상을 위한 합성 데이터셋 구축방안 연구)

  • Youngchae Lee;Sekil Park
    • Journal of Information Technology Applications and Management
    • /
    • v.30 no.6
    • /
    • pp.69-80
    • /
    • 2023
  • Ship collision accidents not only cause loss of life and property damage, but also cause marine pollution and can become national disasters, so prevention is very important. Most of these ship collision accidents are caused by human factors due to the navigation officer's lack of vigilance and carelessness, and in many cases, they can be prevented through the support of a system that helps with situation awareness. Recently, artificial intelligence has been used to develop systems that help navigators recognize the situation, but the sea is very wide and deep, so it is difficult to secure maritime traffic datasets, which also makes it difficult to develop artificial intelligence models. In this paper, to solve these difficulties, we propose a method to build a dataset with characteristics similar to actual maritime traffic datasets. The proposed method uses segmentation and inpainting technologies to build a foreground and background dataset, and then applies compositing technology to create a synthetic dataset. Through prototype implementation and result analysis of the proposed method, it was confirmed that the proposed method is effective in overcoming the difficulties of dataset construction and complementing various scenes similar to reality.

Token-Based Classification and Dataset Construction for Detecting Modified Profanity (변형된 비속어 탐지를 위한 토큰 기반의 분류 및 데이터셋)

  • Sungmin Ko;Youhyun Shin
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.4
    • /
    • pp.181-188
    • /
    • 2024
  • Traditional profanity detection methods have limitations in identifying intentionally altered profanities. This paper introduces a new method based on Named Entity Recognition, a subfield of Natural Language Processing. We developed a profanity detection technique using sequence labeling, for which we constructed a dataset by labeling some profanities in Korean malicious comments and conducted experiments. Additionally, to enhance the model's performance, we augmented the dataset by labeling parts of a Korean hate speech dataset using one of the large language models, ChatGPT, and conducted training. During this process, we confirmed that filtering the dataset created by the large language model by humans alone could improve performance. This suggests that human oversight is still necessary in the dataset augmentation process.

Construction of LiDAR Dataset for Autonomous Driving Considering Domestic Environments and Design of Effective 3D Object Detection Model (국내 주행환경을 고려한 자율주행 라이다 데이터 셋 구축 및 효과적인 3D 객체 검출 모델 설계)

  • Jin-Hee Lee;Jae-Keun Lee;Joohyun Lee;Je-Seok Kim;Soon Kwon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.18 no.5
    • /
    • pp.203-208
    • /
    • 2023
  • Recently, with the growing interest in the field of autonomous driving, many researchers have been focusing on developing autonomous driving software platforms. In particular, we have concentrated on developing 3D object detection models that can improve real-time performance. In this paper, we introduce a self-constructed 3D LiDAR dataset specific to domestic environments and propose a VariFocal-based CenterPoint for the 3D object detection model, with improved performance over the previous models. Furthermore, we present experimental results comparing the performance of the 3D object detection modules using our self-built and public dataset. As the results show, our model, which was trained on a large amount of self-constructed dataset, successfully solves the issue of failing to detect large vehicles and small objects such as motorcycles and pedestrians, which the previous models had difficulty detecting. Consequently, the proposed model shows a performance improvement of about 1.0 mAP over the previous model.

Deterministic and probabilistic analysis of tunnel face stability using support vector machine

  • Li, Bin;Fu, Yong;Hong, Yi;Cao, Zijun
    • Geomechanics and Engineering
    • /
    • v.25 no.1
    • /
    • pp.17-30
    • /
    • 2021
  • This paper develops a convenient approach for deterministic and probabilistic evaluations of tunnel face stability using support vector machine classifiers. The proposed method is comprised of two major steps, i.e., construction of the training dataset and determination of instance-based classifiers. In step one, the orthogonal design is utilized to produce representative samples after the ranges and levels of the factors that influence tunnel face stability are specified. The training dataset is then labeled by two-dimensional strength reduction analyses embedded within OptumG2. For any unknown instance, the second step applies the training dataset for classification, which is achieved by an ad hoc Python program. The classification of unknown samples starts with selection of instance-based training samples using the k-nearest neighbors algorithm, followed by the construction of an instance-based SVM-KNN classifier. It eventually provides labels of the unknown instances, avoiding calculate its corresponding performance function. Probabilistic evaluations are performed by Monte Carlo simulation based on the SVM-KNN classifier. The ratio of the number of unstable samples to the total number of simulated samples is computed and is taken as the failure probability, which is validated and compared with the response surface method.

A Construction of Geographical Distance-based Air Quality Dataset Using Hospital Location Information (병원위치정보를 이용한 지리적 거리기반의 대기환경 데이터셋 구축)

  • Kim, Hyeongsoo;Ryu, Keun Ho
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.34 no.3
    • /
    • pp.231-242
    • /
    • 2016
  • As of late, air quality information has been actively gathered and investigated in order to find possible environmental risk factors that may affect the onset of cardiovascular disease. Nevertheless, existing studies are limited in the detailed analysis because they take advantage of the air quality information of the macro statistics divided into administrative districts. This paper proposes the construction of distance-based air quality dataset using a domestic hospital’s geographical location information as a reliable data gathering step for a more detailed analysis of environmental risk factors. For the construction of the dataset, air quality information was obtained by utilizing the geographical location of a hospital—in which a patient with cardiovascular disease had been admitted—and then matching the hospital with a meteorological and air pollution station in its vicinity. An air quality acquisition system based on GMap.net was devised for the purpose of data gathering and visualization. The reliability of the experiment was confirmed by evaluating the matching rate and error of air quality values between the acquired dataset with existing area-based air quality datasets from matched distances. Therefore, this dataset, which considers geographical information, can be utilized in multidisciplinary research for the discovery of environmental risk factors that can affect not only cardiovascular diseases but also potentially other epidemic diseases.

ENERGY EFFICIENT BUILDING DESIGN THROUGH DATA MINING APPROACH

  • Hyunjoo Kim;Wooyoung Kim
    • International conference on construction engineering and project management
    • /
    • 2009.05a
    • /
    • pp.601-605
    • /
    • 2009
  • The objective of this research is to develop a knowledge discovery framework which can help project teams discover useful patterns to improve energy efficient building design. This paper utilizes the technology of data mining to automatically extract concepts, interrelationships and patterns of interest from a large dataset. By applying data mining technology to the analysis of energy efficient building designs one can identify valid, useful, and previously unknown patterns of energy simulation modeling.

  • PDF

Construction of a Video Dataset for Face Tracking Benchmarking Using a Ground Truth Generation Tool

  • Do, Luu Ngoc;Yang, Hyung Jeong;Kim, Soo Hyung;Lee, Guee Sang;Na, In Seop;Kim, Sun Hee
    • International Journal of Contents
    • /
    • v.10 no.1
    • /
    • pp.1-11
    • /
    • 2014
  • In the current generation of smart mobile devices, object tracking is one of the most important research topics for computer vision. Because human face tracking can be widely used for many applications, collecting a dataset of face videos is necessary for evaluating the performance of a tracker and for comparing different approaches. Unfortunately, the well-known benchmark datasets of face videos are not sufficiently diverse. As a result, it is difficult to compare the accuracy between different tracking algorithms in various conditions, namely illumination, background complexity, and subject movement. In this paper, we propose a new dataset that includes 91 face video clips that were recorded in different conditions. We also provide a semi-automatic ground-truth generation tool that can easily be used to evaluate the performance of face tracking systems. This tool helps to maintain the consistency of the definitions for the ground-truth in each frame. The resulting video data set is used to evaluate well-known approaches and test their efficiency.

Performance analysis of deep learning-based automatic classification of upper endoscopic images according to data construction (딥러닝 기반 상부위장관 내시경 이미지 자동분류의 데이터 구성별 성능 분석 연구)

  • Seo, Jeong Min;Lim, Sang Heon;Kim, Yung Jae;Chung, Jun Won;Kim, Kwang Gi
    • Journal of Korea Multimedia Society
    • /
    • v.25 no.3
    • /
    • pp.451-460
    • /
    • 2022
  • Recently, several deep learning studies have been reported to automatically identify the location of diagnostic devices using endoscopic data. In previous studies, there was no design to determine whether the configuration of the dataset resulted in differences in the accuracy in which artificial intelligence models perform image classification. Studies that are based on large amounts of data are likely to have different results depending on the composition of the dataset or its proportion. In this study, we intended to determine the existence and extent of accuracy according to the composition of the dataset by compiling it into three main types using larynx, esophagus, gastroscopy, and laryngeal endoscopy images.