• Title/Summary/Keyword: Standard Dataset

Search Result 187, Processing Time 0.027 seconds

A Study on Data Quality Evaluation of Administrative Information Dataset (행정정보데이터세트의 데이터 품질평가 연구)

  • Song, Chiho;Yim, Jinhee
    • The Korean Journal of Archival Studies
    • /
    • no.71
    • /
    • pp.237-272
    • /
    • 2022
  • In 2019, the pilot project to establish a record management system for administrative information datasets started in earnest under the leadership of the National Archives. Based on the results of the three-year project by 2021, the improved administrative information dataset management plan will be reflected in public records-related laws and guidelines. Through this, the administrative information dataset becomes the target of full-scale public record management. Although public records have been converted to electronic documents and even the datasets of administrative information systems have been included in full-scale public records management, research on the quality requirements of data itself as raw data constituting records is still lacking. If data quality is not guaranteed, all four properties of records will be threatened in the dataset, which is a structure of data and an aggregate of records. Moreover, if the reliability of the quality of the data of the administrative information system built by reflecting the various needs of the working departments of the institution without considering the standards of the standard records management system is insufficient, the reliability of the public records itself can not be secured. This study is based on the administrative information dataset management plan presented in the "Administrative Information Dataset Recorded Information Service and Utilization Model Study" conducted by the National Archives of Korea in 2021. A study was conducted. By referring to various data, especially public data-related policies and guides, which are being promoted across the government, we would like to derive quality evaluation requirements in terms of records management and present specific indicators. Through this, it is expected that it will be helpful for record management of administrative information dataset which will be in full swing in the future.

Comparison of Estimation Methods in NONMEM 7.2: Application to a Real Clinical Trial Dataset (실제 임상 데이터를 이용한 NONMEM 7.2에 도입된 추정법 비교 연구)

  • Yun, Hwi-Yeol;Chae, Jung-Woo;Kwon, Kwang-Il
    • Korean Journal of Clinical Pharmacy
    • /
    • v.23 no.2
    • /
    • pp.137-141
    • /
    • 2013
  • Purpose: This study compared the performance of new NONMEM estimation methods using a population analysis dataset collected from a clinical study that consisted of 40 individuals and 567 observations after a single oral dose of glimepiride. Method: The NONMEM 7.2 estimation methods tested were first-order conditional estimation with interaction (FOCEI), importance sampling (IMP), importance sampling assisted by mode a posteriori (IMPMAP), iterative two stage (ITS), stochastic approximation expectation-maximization (SAEM), and Markov chain Monte Carlo Bayesian (BAYES) using a two-compartment open model. Results: The parameters estimated by IMP, IMPMAP, ITS, SAEM, and BAYES were similar to those estimated using FOCEI, and the objective function value (OFV) for diagnosing the model criteria was significantly decreased in FOCEI, IMPMAP, SAEM, and BAYES in comparison with IMP. Parameter precision in terms of the estimated standard error was estimated precisely with FOCEI, IMP, IMPMAP, and BAYES. The run time for the model analysis was shortest with BAYES. Conclusion: In conclusion, the new estimation methods in NONMEM 7.2 performed similarly in terms of parameter estimation, but the results in terms of parameter precision and model run times using BAYES were most suitable for analyzing this dataset.

A biomedically oriented automatically annotated Twitter COVID-19 dataset

  • Hernandez, Luis Alberto Robles;Callahan, Tiffany J.;Banda, Juan M.
    • Genomics & Informatics
    • /
    • v.19 no.3
    • /
    • pp.21.1-21.5
    • /
    • 2021
  • The use of social media data, like Twitter, for biomedical research has been gradually increasing over the years. With the coronavirus disease 2019 (COVID-19) pandemic, researchers have turned to more non-traditional sources of clinical data to characterize the disease in near-real time, study the societal implications of interventions, as well as the sequelae that recovered COVID-19 cases present. However, manually curated social media datasets are difficult to come by due to the expensive costs of manual annotation and the efforts needed to identify the correct texts. When datasets are available, they are usually very small and their annotations don't generalize well over time or to larger sets of documents. As part of the 2021 Biomedical Linked Annotation Hackathon, we release our dataset of over 120 million automatically annotated tweets for biomedical research purposes. Incorporating best-practices, we identify tweets with potentially high clinical relevance. We evaluated our work by comparing several SpaCy-based annotation frameworks against a manually annotated gold-standard dataset. Selecting the best method to use for automatic annotation, we then annotated 120 million tweets and released them publicly for future downstream usage within the biomedical domain.

Integration of a Large-Scale Genetic Analysis Workbench Increases the Accessibility of a High-Performance Pathway-Based Analysis Method

  • Lee, Sungyoung;Park, Taesung
    • Genomics & Informatics
    • /
    • v.16 no.4
    • /
    • pp.39.1-39.3
    • /
    • 2018
  • The rapid increase in genetic dataset volume has demanded extensive adoption of biological knowledge to reduce the computational complexity, and the biological pathway is one well-known source of such knowledge. In this regard, we have introduced a novel statistical method that enables the pathway-based association study of large-scale genetic dataset-namely, PHARAOH. However, researcher-level application of the PHARAOH method has been limited by a lack of generally used file formats and the absence of various quality control options that are essential to practical analysis. In order to overcome these limitations, we introduce our integration of the PHARAOH method into our recently developed all-in-one workbench. The proposed new PHARAOH program not only supports various de facto standard genetic data formats but also provides many quality control measures and filters based on those measures. We expect that our updated PHARAOH provides advanced accessibility of the pathway-level analysis of large-scale genetic datasets to researchers.

AraProdMatch: A Machine Learning Approach for Product Matching in E-Commerce

  • Alabdullatif, Aisha;Aloud, Monira
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.4
    • /
    • pp.214-222
    • /
    • 2021
  • Recently, the growth of e-commerce in Saudi Arabia has been exponential, bringing new remarkable challenges. A naive approach for product matching and categorization is needed to help consumers choose the right store to purchase a product. This paper presents a machine learning approach for product matching that combines deep learning techniques with standard artificial neural networks (ANNs). Existing methods focused on product matching, whereas our model compares products based on unstructured descriptions. We evaluated our electronics dataset model from three business-to-consumer (B2C) online stores by putting the match products collectively in one dataset. The performance evaluation based on k-mean classifier prediction from three real-world online stores demonstrates that the proposed algorithm outperforms the benchmarked approach by 80% on average F1-measure.

The Design of Polynomial RBF Neural Network by Means of Fuzzy Inference System and Its Optimization (퍼지추론 기반 다항식 RBF 뉴럴 네트워크의 설계 및 최적화)

  • Baek, Jin-Yeol;Park, Byaung-Jun;Oh, Sung-Kwun
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.58 no.2
    • /
    • pp.399-406
    • /
    • 2009
  • In this study, Polynomial Radial Basis Function Neural Network(pRBFNN) based on Fuzzy Inference System is designed and its parameters such as learning rate, momentum coefficient, and distributed weight (width of RBF) are optimized by means of Particle Swarm Optimization. The proposed model can be expressed as three functional module that consists of condition part, conclusion part, and inference part in the viewpoint of fuzzy rule formed in 'If-then'. In the condition part of pRBFNN as a fuzzy rule, input space is partitioned by defining kernel functions (RBFs). Here, the structure of kernel functions, namely, RBF is generated from HCM clustering algorithm. We use Gaussian type and Inverse multiquadratic type as a RBF. Besides these types of RBF, Conic RBF is also proposed and used as a kernel function. Also, in order to reflect the characteristic of dataset when partitioning input space, we consider the width of RBF defined by standard deviation of dataset. In the conclusion part, the connection weights of pRBFNN are represented as a polynomial which is the extended structure of the general RBF neural network with constant as a connection weights. Finally, the output of model is decided by the fuzzy inference of the inference part of pRBFNN. In order to evaluate the proposed model, nonlinear function with 2 inputs, waster water dataset and gas furnace time series dataset are used and the results of pRBFNN are compared with some previous models. Approximation as well as generalization abilities are discussed with these results.

VS Prediction Model Using SPT-N Values and Soil Layers in South Korea (표준관입시험 및 시추공 정보를 이용한 국내 전단파속도 예측)

  • Heo, Gi-Seok;Kwak, Dong-Youp
    • Journal of the Korean Geotechnical Society
    • /
    • v.38 no.8
    • /
    • pp.53-66
    • /
    • 2022
  • The national ground survey database (GeoInfo) distributes numerous ground survey data nationwide. Many standard penetration test results exist in this database; however, the number of shear wave velocity (VS) data is small. Hence, to use abundant standard penetration test-N values to predict VS, this study proposed a new empirical N-VS relationship model using GeoInfo data. The proposed N-VS model is a single equation regardless of geological layer types; the layer type only specifies the upper limit of VS. To validate the proposed model, residual analysis was performed using a test dataset that was not used for the model development. Therefore, this study's proposed model performed better than N-VS models from previous studies. Since the N-VS model in this study was developed using sufficient data from GeoInfo, we expect that it is the most applicable to GeoInfo dataset for VS prediction.

A Study of the Standard Structure for the Social Disaster and Safety Incidents Data (사회재난 및 안전사고 데이터 분석을 위한 표준 구조 연구)

  • Lee, Chang Yeol;Kim, Taehwan
    • Journal of the Society of Disaster Information
    • /
    • v.17 no.4
    • /
    • pp.817-828
    • /
    • 2021
  • Purpose: In this paper, we propose a common dataset structure which includes the incidents investigation information and features data for machine learning. Most of the data is from the incidents reports of the governmental part and restricts on the social disaster and safety areas. Method: Firstly, we extract basic incidents data from the several incident investigation reports. The data includes the cause, damage, date, classification of the incidents and additionally considers the feature data for the machine learning. All data is represented by XML standard notation. Result: We defined the standard XML schema and the example for the incidents investigation information. Conclusion: We defined the common incidents dataset structure for the machine learning. It may play roles of the common infrastructure for the disaster and safety applications areas

Proposal of Standardization Plan for Defense Unstructured Datasets based on Unstructured Dataset Standard Format (비정형 데이터셋 표준포맷 기반 국방 비정형 데이터셋 표준화 방안 제안)

  • Yun-Young Hwang;Jiseong Son
    • Journal of Internet Computing and Services
    • /
    • v.25 no.1
    • /
    • pp.189-198
    • /
    • 2024
  • AI is accepted not only in the private sector but also in the defense sector as a cutting-edge technology that must be introduced for the development of national defense. In particular, artificial intelligence has been selected as a key task in defense science and technology innovation, and the importance of data is increasing. As the national defense department shifts from a closed data policy to data sharing and activation, efforts are being made to secure high-quality data necessary for the development of national defense. In particular, we are promoting a review of the business budget system to secure data so that related procedures can be improved to reflect the unique characteristics of AI and big data, and research and development can begin with sufficient large quantities and high-quality data. However, there is a need to establish standardization and quality standards for structured data and unstructured data at the national defense level, but the defense department is still proposing standardization and quality standards for structured data, so this needs to be supplemented. In this paper, we propose an unstructured data set standard format for defense unstructured data sets, which are most needed in defense artificial intelligence, and based on this, we propose a standardization method for defense unstructured data sets.

Machine Reading Comprehension-based Question and Answering System for Search and Analysis of Safety Standards (안전기준의 검색과 분석을 위한 기계독해 기반 질의응답 시스템)

  • Kim, Minho;Cho, Sanghyun;Park, Dugkeun;Kwon, Hyuk-Chul
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.2
    • /
    • pp.351-360
    • /
    • 2020
  • If various unreasonable safety standards are preemptively and effectively readjusted, the risk of accidents can be reduced. In this paper, we proposed a machine reading comprehension-based safety standard Q&A system to secure supporting technology for effective search and analysis of safety standards for integrated and systematic management of safety standards. The proposed model finds documents related to safety standard questions in the various laws and regulations, and then divides these documents into provisions. Only those provisions that are likely to contain the answer to the question are selected, and then the BERT-based machine reading comprehension model is used to find answers to questions related to safety standards. When the proposed safety standard Q&A system is applied to KorQuAD dataset, the performance of EM 40.42% and F1 55.34% are shown.