• Title/Summary/Keyword: Data cleansing

Search Result 74, Processing Time 0.024 seconds

Extraction Transformation Transportation (ETT) system Design and implementation for extracting heterogeneous Data on Data Warehouse (데이터웨어하우스에서 이질적 형태를 가진 데이터의 추출을 위한 Extraction Transformation Transportation(ETT) 시스템 설계 및 구현)

  • 여성주;왕지남
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.24 no.67
    • /
    • pp.49-60
    • /
    • 2001
  • Data warehouse(DW) manages all information in a Enterprise and also offers the specific information to users. However, it might be difficult to develope an effective DW system due to varieties in computing facilities, data base, and operating systems. The heterogeneous system environments make it harder to extract data and to provide proper information to usesr in real time. Also commonly occurred is data inconsistency of non-integrated legacy system, which requires an effective and efficient data extraction flow control as well as data cleansing. We design the integrated automatic ETT(Extraction Transformation Transportation) system to control data extraction flow and suggest implementation methodology. Detail analysis and design are given to specify the proposed ETT approach with a real implementation.

  • PDF

Development of a Component-Based Chamois Data Cleansing Tool Suits (컴포넌트 기반 샤모아 데이터 정제 도구 개발)

  • 김은희;최병주
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.10b
    • /
    • pp.310-312
    • /
    • 2003
  • 샤모아 지식공학 시스템(Chamois Knowledge Engineering System)은 대용량의 데이터 소스로부터 의미 있는 지식을 추출하는 시스템이다. 이러한 지식공학 시스템에서 데이터 소스의 품질을 보장하는 일은 매우 중요하다. 본 논문에서는 샤모아 지식공학 시스템에서의 데이터 정제관련 컴포넌트의 구조 및 동작에 대해 기술한다. 또한 이들 컴포넌트들이 동작할 수 있는 컴포넌트 프레임웍의 기능 및 동작에 대해 기술한다. 구현한 데이터 정제 관련 컴포넌트는 컴포넌트 기반의 시스템에서 데이터의 정제를 통해 신뢰성 있는 데이터를 제공하고, 이를 통해 개발하고자 하는 시스템의 품질을 향상 시킬 수 있다.

  • PDF

Symbolizing Numbers to Improve Neural Machine Translation (숫자 기호화를 통한 신경기계번역 성능 향상)

  • Kang, Cheongwoong;Ro, Youngheon;Kim, Jisu;Choi, Heeyoul
    • Journal of Digital Contents Society
    • /
    • v.19 no.6
    • /
    • pp.1161-1167
    • /
    • 2018
  • The development of machine learning has enabled machines to perform delicate tasks that only humans could do, and thus many companies have introduced machine learning based translators. Existing translators have good performances but they have problems in number translation. The translators often mistranslate numbers when the input sentence includes a large number. Furthermore, the output sentence structure completely changes even if only one number in the input sentence changes. In this paper, first, we optimized a neural machine translation model architecture that uses bidirectional RNN, LSTM, and the attention mechanism through data cleansing and changing the dictionary size. Then, we implemented a number-processing algorithm specialized in number translation and applied it to the neural machine translation model to solve the problems above. The paper includes the data cleansing method, an optimal dictionary size and the number-processing algorithm, as well as experiment results for translation performance based on the BLEU score.

An Integrated Framework for Data Quality Management of Traffic Data Warehouses (고품질 데이터를 지원하는 교통데이터 웨어하우스 구축 기법)

  • Hwang, Jae-Il;Park, Seung-Yong;Nah, Yun-Mook
    • Journal of Korea Spatial Information System Society
    • /
    • v.10 no.4
    • /
    • pp.89-95
    • /
    • 2008
  • In this paper, we propose an integrated techniques for managing data quality in traffic data warehousing environments. We describe how to collect and construct the traffic data warehouses from the operational databases, such as FTMS and ARTIS. We explain how to configure the traffic data warehouses efficiently. Also, we propose a quality management techniques to provide high quality traffic data for various analytical transactions. Proposed techniques can contribute in providing high quality traffic data to the traffic related users and researcher, thus reducing data preprocessing and evaluation cost.

  • PDF

Implementation of a data collection system for big data analysis and learning based on infant body temperature data (영유아 체온 데이터 기반 빅데이터 분석 및 학습을 위한 데이터 수집 시스템 구현)

  • Lee, Hyoun-Sup;Heo, Gyeongyong
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.05a
    • /
    • pp.577-578
    • /
    • 2021
  • Recently, artificial intelligence systems are being used in various fields. The accuracy of the decision algorithm of artificial intelligence is greatly affected by the amount of learning and the accuracy of the learning data. In the case of the amount of learning, a large amount of data is required because it has a decisive effect on the performance of AI. In this paper, we propose a data collection system for constructing a system that analyzes future conditions and changes in infants' conditions based on the body temperature data of infants and toddlers. The proposed system is a system that collects and transmits data, and it is believed that it can minimize the resource consumption of the server system in existing big data analysis and training data construction.

  • PDF

Development and Comparison of Data Mining-based Prediction Models of Building Fire Probability

  • Hong, Sung-gwan;Jeong, Seung Ryul
    • Journal of Internet Computing and Services
    • /
    • v.19 no.6
    • /
    • pp.101-112
    • /
    • 2018
  • A lot of manpower and budgets are being used to prevent fires, and only a small portion of the data generated during this process is used for disaster prevention activities. This study develops a prediction model of fire occurrence probability based on data mining in order to more actively use these data for disaster prevention activities. For this purpose, variables for predicting fire occurrence probability of various buildings were selected and data of construction administrative system, national fire information system, and Korea Fire Insurance Association were collected and integrated data set was constructed. After appropriate data cleansing and preprocessing, various data mining methodologies such as artificial neural network, decision trees, SVM, and Naive Bayesian were used to develop a prediction model of the fire occurrence probability of buildings. The most accurate model among the derived models is Linear SVM model which shows 68.42% as experimental data and 63.54% as verification data and it is the best model to predict fire occurrence probability of buildings. As this study develops the prediction model which uses only the set values of the specific ranges, future studies may explore more opportunites to use various setting values not shown in this study.

A study on the implementation of infection control at dental offices (치과 진료실 감염방지 실천에 관한 연구)

  • Woo, Seung-Hee;Kwag, Jung-Suk;Ju, On-Ju;Lim, Kun-Ok
    • Journal of Korean society of Dental Hygiene
    • /
    • v.9 no.3
    • /
    • pp.282-293
    • /
    • 2009
  • The purpose of this study was to examine the degree of infection control implemented at dental offices and factors affecting it in an attempt to help promote the health of dental health care workers. The subjects in this study were 180 medical personnels who worked at dental offices in the region of South Jeolla Province. A self-administered survey was conducted from April 1 to May 30, 2008, and the collected data were analyzed. The findings of the study were as follows: 1. As for the implementation of infection control at the dental offices, what the health care workers investigated did the most was post-treatment hand washing(95.0), a constant separation of infectious wastes(94.4), wearing rubber gloves all the time during medical instrument cleansing(92.8) and pre-treatment hand washing(91.7). 2. In regard to the implementation of infection control at the dental offices, what the dental personnels did the least was drying their hands with air(5.0), wearing goggles in times of treatment(23.3), receiving regular education on infection control(26.7) and putting sterilizers to a performance test on a regular basis(43.9). 3. The dental health care workers were significantly different according to age in the management of contagious diseases(p=0.005). Their career made a significant difference to the management of contagious diseases(p=0.000) and instrument cleansing/sterilization(p=0.043). The service area made a significant difference to wearing and managing personal protective clothes (p=0.040) and waste management(p=0.040). 4. Concerning the relationship between the acquisition of dental hygienist certificate and the practice of infection control, whether the dental health care workers were certified or not made no significant difference to that. 5. As to the correlation among the factors affecting the prevention and management of contagious diseases, there was a positive correlation among hand washing(r=0.379), wearing and managing personal protective clothes(r=0.349), instrument cleansing/sterilization(r=0.323) and waste management(r=0.388). All the factors made a statistically significant difference to the prevention and management of contagious diseases(p<0.01).

  • PDF

Implementation of Policy based In-depth Searching for Identical Entities and Cleansing System in LOD Cloud (LOD 클라우드에서의 연결정책 기반 동일개체 심층검색 및 정제 시스템 구현)

  • Kim, Kwangmin;Sohn, Yonglak
    • Journal of Internet Computing and Services
    • /
    • v.19 no.3
    • /
    • pp.67-77
    • /
    • 2018
  • This paper suggests that LOD establishes its own link policy and publishes it to LOD cloud to provide identity among entities in different LODs. For specifying the link policy, we proposed vocabulary set founded on RDF model as well. We implemented Policy based In-depth Searching and Cleansing(PISC for short) system that proceeds in-depth searching across LODs by referencing the link policies. PISC has been published on Github. LODs have participated voluntarily to LOD cloud so that degree of the entity identity needs to be evaluated. PISC, therefore, evaluates the identities and cleanses the searched entities to confine them to that exceed user's criterion of entity identity level. As for searching results, PISC provides entity's detailed contents which have been collected from diverse LODs and ontology customized to the content. Simulation of PISC has been performed on DBpedia's 5 LODs. We found that similarity of 0.9 of source and target RDF triples' objects provided appropriate expansion ratio and inclusion ratio of searching result. For sufficient identity of searched entities, 3 or more target LODs are required to be specified in link policy.

Design of data cleansing system based on XMDR for Datawarehouse (데이터웨어하우스를 위한 XMDR 기반의 데이터 정제시스템 설계)

  • Song, Hong-Youl;Ayush, Tsend;Jung, Kye-Dong;Choi, Young-Keun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.04a
    • /
    • pp.180-182
    • /
    • 2010
  • 데이터웨어하우스는 기업의 정책을 결정하는데 사용하고 있다. 그러나, 새로운 시스템이 추가되면 데이터 통합 측면에서 시스템간의 여러 가지 이질적인 특성으로 인해 많은 비용과 시간이 필요로 하게 된다. 따라서, 이러한 이질적인 특성을 해결하기 위해 데이터 구조의 이질성 및 데이터 표현의 이질성은 XMDR(eXtended Master Data Registry)를 이용하여 추상화된 쿼리를 생성하고, XMDR에 맞게 쿼리를 분리함으로써 이질성을 해결한다. 특히 본 논문에서는 XMDR을 이용하여 분산 시스템 통합시 로컬시스템의 영향을 최소화하고, 데이터웨어하우스의 정보를 실시간으로 생성하기 위해 분산된 환경에서 데이터 통합을 위한 표준화된 정보를 제공한다. 또한, 기존 시스템의 변경 없이 데이터를 통합하여 비용과 시간을 절감하고, 실시간 데이터 추출 및 정제 작업을 통해 일관성있는 실시간 정보를 생성하여 정보의 품질을 향상시킬수 있도록 한다.

The Effect of Bowel Preparation Convergence Program for Colonoscopy (대장내시경 전처치 융합관리프로그램의 효과)

  • Kang, Won-Suk;Kim, Ju-Sung
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.1
    • /
    • pp.473-483
    • /
    • 2018
  • The purpose of this study was to identify the effects of bowel preparation convergence program for colonoscopy. This study used a nonequivalent control group pretest-posttest design. A sample of 75 clients, who were scheduled for colonoscopy, was included. The experimental group was given bowel preparation convergence program including audiovisual education, walking-exercise and telephone counseling. The data were collected using a structured questionnaire and colonoscopy monitoring and were analyzed using SPSS 21.0 program. The experimental group reported significantly higher compliance of taking bowel preparation agents and test satisfaction(p=.002; p=.001), lower test difficulty and test discomfort than those of the control group(p=.002; p=.001). There were significant differences in level of bowel cleansing and test time required except compliance of diet restriction between groups(p<.001; p=.001; p=.108). This findings indicate that bowel preparation convergence program can be an effective nursing intervention for colonoscopy. The convergence intervention for diagnostic test is needed to be developed in clinical practice.