• Title/Summary/Keyword: datasets

Search Result 1,978, Processing Time 0.029 seconds

Research on Diversification of Transfer Specifications and Reproduction Methods for Administrative Information Datasets (행정정보 데이터세트의 이관규격의 다양화 및 재현 방안에 관한 연구)

  • Dongmin Yang;Kwanghoon Choi;Ji-Hye Kim;Nam-Hee Yoo
    • Journal of the Korean Society for information Management
    • /
    • v.40 no.4
    • /
    • pp.167-200
    • /
    • 2023
  • For the record management of administrative information datasets in Korea, it is recommended to utilize SIARD as a transfer specification when transferring administrative information datasets. However, there are many cases where the application of SIARD is not suitable due to the record management unit of administrative information datasets, technical limitations of tools that support SIARD, and the realistic situation of public institutions. In this study, we propose a plan to diversify the transfer specifications of administrative information datasets other than SIARD. In the record management of administrative information datasets, the need to reproduce the user interface associated with the dataset has been discussed but not specifically presented. This study confirms that the user interface is a property to be preserved from the perspective of Significant Properties, proposes a method to effectively reproduce the user interface, and provides an example of actual verification.

A Study on the Improvement of the Management Reference Tables for Datasets in Administrative Information Systems (행정정보 데이터세트의 관리기준표 개선방안 연구)

  • Lee, Jung-eun;Kim, Ji-Hye;Wang, Ho-sung;Yang, Dongmin
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.22 no.1
    • /
    • pp.177-200
    • /
    • 2022
  • Administrative information datasets are a kind of record produced based on an organization's work performance. A dataset is evidence of the act of recording and contains a lot of information that can be used for work. Datasets have been neglected in Korea's records management system. However, as the law was revised in 2020, the management of administrative information datasets was legislated. Organizations that require management of administrative information datasets have already gradually begun record management. The core of managing administrative information datasets is the preparation of the Management Reference Table for the dataset. Regardless, there is confusion with the Records Management Reference Table for Dataset in institutions that work on records management, and it is difficult to work because the Management Reference Table for Dataset has a new concept. This study looked into the problems in the records management of datasets that appeared at the beginning of work. It isuggests a method to effectively settle records management for datasets. In that way, the Management Reference Table was selected as the research subject, and the problems discussed so far were summarized. In addition, the items of the current Management Reference Table were analyzed. As a result of the study, we have proposed the simplification of items in the Management Reference Table, the reorganization of areas in the Management Reference Table, the introduction of the concept of retention periods, and the preparation process of the Management Reference Table.

Comprehensive analysis of deep learning-based target classifiers in small and imbalanced active sonar datasets (소량 및 불균형 능동소나 데이터세트에 대한 딥러닝 기반 표적식별기의 종합적인 분석)

  • Geunhwan Kim;Youngsang Hwang;Sungjin Shin;Juho Kim;Soobok Hwang;Youngmin Choo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.42 no.4
    • /
    • pp.329-344
    • /
    • 2023
  • In this study, we comprehensively analyze the generalization performance of various deep learning-based active sonar target classifiers when applied to small and imbalanced active sonar datasets. To generate the active sonar datasets, we use data from two different oceanic experiments conducted at different times and ocean. Each sample in the active sonar datasets is a time-frequency domain image, which is extracted from audio signal of contact after the detection process. For the comprehensive analysis, we utilize 22 Convolutional Neural Networks (CNN) models. Two datasets are used as train/validation datasets and test datasets, alternatively. To calculate the variance in the output of the target classifiers, the train/validation/test datasets are repeated 10 times. Hyperparameters for training are optimized using Bayesian optimization. The results demonstrate that shallow CNN models show superior robustness and generalization performance compared to most of deep CNN models. The results from this paper can serve as a valuable reference for future research directions in deep learning-based active sonar target classification.

Non-negligible Occurrence of Errors in Gender Description in Public Data Sets

  • Kim, Jong Hwan;Park, Jong-Luyl;Kim, Seon-Young
    • Genomics & Informatics
    • /
    • v.14 no.1
    • /
    • pp.34-40
    • /
    • 2016
  • Due to advances in omics technologies, numerous genome-wide studies on human samples have been published, and most of the omics data with the associated clinical information are available in public repositories, such as Gene Expression Omnibus and ArrayExpress. While analyzing several public datasets, we observed that errors in gender information occur quite often in public datasets. When we analyzed the gender description and the methylation patterns of gender-specific probes (glucose-6-phosphate dehydrogenase [G6PD], ephrin-B1 [EFNB1], and testis specific protein, Y-linked 2 [TSPY2]) in 5,611 samples produced using Infinium 450K HumanMethylation arrays, we found that 19 samples from 7 datasets were erroneously described. We also analyzed 1,819 samples produced using the Affymetrix U133Plus2 array using several gender-specific genes (X (inactive)-specific transcript [XIST], eukaryotic translation initiation factor 1A, Y-linked [EIF1AY], and DEAD [Asp-Glu-Ala-Asp] box polypeptide 3, Y-linked [DDDX3Y]) and found that 40 samples from 3 datasets were erroneously described. We suggest that the users of public datasets should not expect that the data are error-free and, whenever possible, that they should check the consistency of the data.

Establishing the Process of Spatial Informatization Using Data from Social Network Services

  • Eo, Seung-Won;Lee, Youngmin;Yu, Kiyun;Park, Woojin
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.34 no.2
    • /
    • pp.111-120
    • /
    • 2016
  • Prior knowledge about the SNS (Social Network Services) datasets is often required to conduct valuable analysis using social media data. Understanding the characteristics of the information extracted from SNS datasets leaves much to be desired in many ways. This paper purposes on analyzing the detail of the target social network services, Twitter, Instagram, and YouTube to establish the spatial informatization process to integrate social media information with existing spatial datasets. In this study, valuable information in SNS datasets have been selected and total 12,938 data have been collected in Seoul via Open API. The dataset has been geo-coded and turned into the point form. We also removed the overlapped values of the dataset to conduct spatial integration with the existing building layers. The resultant of this spatial integration process will be utilized in various industries and become a fundamental resource to further studies related to geospatial integration using social media datasets.

High Utility Itemset Mining over Uncertain Datasets Based on a Quantum Genetic Algorithm

  • Wang, Ju;Liu, Fuxian;Jin, Chunjie
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.8
    • /
    • pp.3606-3629
    • /
    • 2018
  • The discovered high potential utility itemsets (HPUIs) have significant influence on a variety of areas, such as retail marketing, web click analysis, and biological gene analysis. Thus, in this paper, we propose an algorithm called HPUIM-QGA (Mining high potential utility itemsets based on a quantum genetic algorithm) to mine HPUIs over uncertain datasets based on a quantum genetic algorithm (QGA). The proposed algorithm not only can handle the problem of the non-downward closure property by developing an upper bound of the potential utility (UBPU) (which prunes the unpromising itemsets in the early stage) but can also handle the problem of combinatorial explosion by introducing a QGA, which finds optimal solutions quickly and needs to set only very few parameters. Furthermore, a pruning strategy has been designed to avoid the meaningless and redundant itemsets that are generated in the evolution process of the QGA. As proof of the HPUIM-QGA, a substantial number of experiments are performed on the runtime, memory usage, analysis of the discovered itemsets and the convergence on real-life and synthetic datasets. The results show that our proposed algorithm is reasonable and acceptable for mining meaningful HPUIs from uncertain datasets.

Land Cover Classification Map of Northeast Asia Using GOCI Data

  • Son, Sanghun;Kim, Jinsoo
    • Korean Journal of Remote Sensing
    • /
    • v.35 no.1
    • /
    • pp.83-92
    • /
    • 2019
  • Land cover (LC) is an important factor in socioeconomic and environmental studies. According to various studies, a number of LC maps, including global land cover (GLC) datasets, are made using polar orbit satellite data. Due to the insufficiencies of reference datasets in Northeast Asia, several LC maps display discrepancies in that region. In this paper, we performed a feasibility assessment of LC mapping using Geostationary Ocean Color Imager (GOCI) data over Northeast Asia. To produce the LC map, the GOCI normalized difference vegetation index (NDVI) was used as an input dataset and a level-2 LC map of South Korea was used as a reference dataset to evaluate the LC map. In this paper, 7 LC types(urban, croplands, forest, grasslands, wetlands, barren, and water) were defined to reflect Northeast Asian LC. The LC map was produced via principal component analysis (PCA) with K-means clustering, and a sensitivity analysis was performed. The overall accuracy was calculated to be 77.94%. Furthermore, to assess the accuracy of the LC map not only in South Korea but also in Northeast Asia, 6 GLC datasets (IGBP, UMD, GLC2000, GlobCover2009, MCD12Q1, GlobeLand30) were used as comparison datasets. The accuracy scores for the 6 GLC datasets were calculated to be 59.41%, 56.82%, 60.97%, 51.71%, 70.24%, and 72.80%, respectively. Therefore, the first attempt to produce the LC map using geostationary satellite data is considered to be acceptable.

Development of Tourism Information Named Entity Recognition Datasets for the Fine-tune KoBERT-CRF Model

  • Jwa, Myeong-Cheol;Jwa, Jeong-Woo
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.14 no.2
    • /
    • pp.55-62
    • /
    • 2022
  • A smart tourism chatbot is needed as a user interface to efficiently provide smart tourism services such as recommended travel products, tourist information, my travel itinerary, and tour guide service to tourists. We have been developed a smart tourism app and a smart tourism information system that provide smart tourism services to tourists. We also developed a smart tourism chatbot service consisting of khaiii morpheme analyzer, rule-based intention classification, and tourism information knowledge base using Neo4j graph database. In this paper, we develop the Korean and English smart tourism Name Entity (NE) datasets required for the development of the NER model using the pre-trained language models (PLMs) for the smart tourism chatbot system. We create the tourism information NER datasets by collecting source data through smart tourism app, visitJeju web of Jeju Tourism Organization (JTO), and web search, and preprocessing it using Korean and English tourism information Name Entity dictionaries. We perform training on the KoBERT-CRF NER model using the developed Korean and English tourism information NER datasets. The weight-averaged precision, recall, and f1 scores are 0.94, 0.92 and 0.94 on Korean and English tourism information NER datasets.

A Deep Learning Approach for Classification of Cloud Image Patches on Small Datasets

  • Phung, Van Hiep;Rhee, Eun Joo
    • Journal of information and communication convergence engineering
    • /
    • v.16 no.3
    • /
    • pp.173-178
    • /
    • 2018
  • Accurate classification of cloud images is a challenging task. Almost all the existing methods rely on hand-crafted feature extraction. Their limitation is low discriminative power. In the recent years, deep learning with convolution neural networks (CNNs), which can auto extract features, has achieved promising results in many computer vision and image understanding fields. However, deep learning approaches usually need large datasets. This paper proposes a deep learning approach for classification of cloud image patches on small datasets. First, we design a suitable deep learning model for small datasets using a CNN, and then we apply data augmentation and dropout regularization techniques to increase the generalization of the model. The experiments for the proposed approach were performed on SWIMCAT small dataset with k-fold cross-validation. The experimental results demonstrated perfect classification accuracy for most classes on every fold, and confirmed both the high accuracy and the robustness of the proposed model.

Synergic Effect of using the Optical and Radar Image Data for the Land Cover Classification in Coastal Region

  • Kim, Sun-Hwa;Lee, Kyu-Sung
    • Proceedings of the KSRS Conference
    • /
    • 2003.11a
    • /
    • pp.1030-1032
    • /
    • 2003
  • This study a imed to analyze the effect of combined optical and radar image for the land cover classification in coastal region. The study area, Gyeonggi Bay area has one of the largest tidal ranges and has frequent land cover changes due to the several reclamations and rather intensive land uses. Ten land cover types were classified using several datasets of combining Landsat ETM+ and RADARSAT imagery. The synergic effects of the merged datasets were analyzed by both visual interpretation and an ordinary supervised classification. The merged optical and SAR datasets provided better discrimination among the land cover classes in the coastal area. The overall classification accuracy of merged datasets was improved to 86.5% as compared to 78% accuracy of using ETM+ only.

  • PDF