• Title/Summary/Keyword: Structured Data

Search Result 4,008, Processing Time 0.031 seconds

A Study on Satisfaction Survey Based on Regression Analysis to Improve Curriculum for Big Data Education (빅데이터 양성 교육 교과과정 개선을 위한 회귀분석 기반의 만족도 조사에 관한 연구)

  • Choi, Hyun
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.22 no.6
    • /
    • pp.749-756
    • /
    • 2019
  • Big data is structured and unstructured data that is so difficult to collect, store, and so on due to the huge amount of data. Many institutions, including universities, are building student convergence systems to foster talents for data science and AI convergence, but there is an absolute lack of research on what kind of education is needed and what kind of education is required for students. Therefore, in this paper, after conducting the correlation analysis based on the questionnaire on basic surveys and courses to improve the curriculum by grasping the satisfaction and demands of the participants in the "2019 Big Data Youth Talent Training Course" held at K University, Regression analysis was performed. As a result of the study, the higher the satisfaction level, the satisfaction with class or job connection, and the self-development, the more positive the evaluation of program efficiency.

A MVC Framework for Visualizing Text Data (텍스트 데이터 시각화를 위한 MVC 프레임워크)

  • Choi, Kwang Sun;Jeong, Kyo Sung;Kim, Soo Dong
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.39-58
    • /
    • 2014
  • As the importance of big data and related technologies continues to grow in the industry, it has become highlighted to visualize results of processing and analyzing big data. Visualization of data delivers people effectiveness and clarity for understanding the result of analyzing. By the way, visualization has a role as the GUI (Graphical User Interface) that supports communications between people and analysis systems. Usually to make development and maintenance easier, these GUI parts should be loosely coupled from the parts of processing and analyzing data. And also to implement a loosely coupled architecture, it is necessary to adopt design patterns such as MVC (Model-View-Controller) which is designed for minimizing coupling between UI part and data processing part. On the other hand, big data can be classified as structured data and unstructured data. The visualization of structured data is relatively easy to unstructured data. For all that, as it has been spread out that the people utilize and analyze unstructured data, they usually develop the visualization system only for each project to overcome the limitation traditional visualization system for structured data. Furthermore, for text data which covers a huge part of unstructured data, visualization of data is more difficult. It results from the complexity of technology for analyzing text data as like linguistic analysis, text mining, social network analysis, and so on. And also those technologies are not standardized. This situation makes it more difficult to reuse the visualization system of a project to other projects. We assume that the reason is lack of commonality design of visualization system considering to expanse it to other system. In our research, we suggest a common information model for visualizing text data and propose a comprehensive and reusable framework, TexVizu, for visualizing text data. At first, we survey representative researches in text visualization era. And also we identify common elements for text visualization and common patterns among various cases of its. And then we review and analyze elements and patterns with three different viewpoints as structural viewpoint, interactive viewpoint, and semantic viewpoint. And then we design an integrated model of text data which represent elements for visualization. The structural viewpoint is for identifying structural element from various text documents as like title, author, body, and so on. The interactive viewpoint is for identifying the types of relations and interactions between text documents as like post, comment, reply and so on. The semantic viewpoint is for identifying semantic elements which extracted from analyzing text data linguistically and are represented as tags for classifying types of entity as like people, place or location, time, event and so on. After then we extract and choose common requirements for visualizing text data. The requirements are categorized as four types which are structure information, content information, relation information, trend information. Each type of requirements comprised with required visualization techniques, data and goal (what to know). These requirements are common and key requirement for design a framework which keep that a visualization system are loosely coupled from data processing or analyzing system. Finally we designed a common text visualization framework, TexVizu which is reusable and expansible for various visualization projects by collaborating with various Text Data Loader and Analytical Text Data Visualizer via common interfaces as like ITextDataLoader and IATDProvider. And also TexVisu is comprised with Analytical Text Data Model, Analytical Text Data Storage and Analytical Text Data Controller. In this framework, external components are the specifications of required interfaces for collaborating with this framework. As an experiment, we also adopt this framework into two text visualization systems as like a social opinion mining system and an online news analysis system.

Development and Validation of Exposure Models for Construction Industry: Tier 1 Model (건설업 유해화학물질 노출 모델의 개발 및 검증: Tier-1 노출 모델)

  • Kim, Seung Won;Jang, Jiyoung;Kim, Gab Bae
    • Journal of Korean Society of Occupational and Environmental Hygiene
    • /
    • v.24 no.2
    • /
    • pp.208-218
    • /
    • 2014
  • Objectives: The major objective of this study was to develop and validate a tier 1 exposure model utilizing worker exposure monitoring data and characteristics of worker activities routinely performed at construction sites, in order to estimate worker exposures without sampling. Methods: The Registration, Evaluation, Authorization and Restriction of Chemicals(REACH) system of the European Union(EU) allows the usage of exposure models for anticipating chemical exposure of manufacturing workers and consumers. Several exposure models have been developed such as Advanced REACH Tools(ART). The ART model is based on structured subjective assessment model. Using the same framework, a tier 1 exposure model has been developed. Worker activities at construction sites have been analyzed and modifying factors have been assigned for each activity. Korean Occupational Safety and Health Agency(KOSHA) accrued work exposure monitoring data for the last 10 years, which were retrieved and converted into exposure scores. A separate set of sampling data were collected to validate the developed exposure model. These algorithm have been realized on Excel spreadsheet for convenience and easy access. Results: The correlation coefficient of the developed model between exposure scores and monitoring data was 0.36, which is smaller than those of EU models(0.6~0.7). One of the main reasons explaining the discrepancy is poor description on worker activities in KOSHA database. Conclusions: The developed tier 1 exposure model can help industrial hygienists judge whether or not air sampling is required or not.

AutoML and CNN-based Soft-voting Ensemble Classification Model For Road Traffic Emerging Risk Detection (도로교통 이머징 리스크 탐지를 위한 AutoML과 CNN 기반 소프트 보팅 앙상블 분류 모델)

  • Jeon, Byeong-Uk;Kang, Ji-Soo;Chung, Kyungyong
    • Journal of Convergence for Information Technology
    • /
    • v.11 no.7
    • /
    • pp.14-20
    • /
    • 2021
  • Most accidents caused by road icing in winter lead to major accidents. Because it is difficult for the driver to detect the road icing in advance. In this work, we study how to accurately detect road traffic emerging risk using AutoML and CNN's ensemble model that use both structured and unstructured data. We train CNN-based road traffic emerging risk classification model using images that are unstructured data and AutoML-based road traffic emerging risk classification model using weather data that is structured data, respectively. After that the ensemble model is designed to complement the CNN-based classification model by inputting probability values derived from of each models. Through this, improves road traffic emerging risk classification performance and alerts drivers more accurately and quickly to enable safe driving.

A Study on Text Mining Methods to Analyze Civil Complaints: Structured Association Analysis (민원 분석을 위한 텍스트 마이닝 기법 연구: 계층적 연관성 분석)

  • Kim, HyunJong;Lee, TaiHun;Ryu, SeungEui;Kim, NaRang
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.23 no.3
    • /
    • pp.13-24
    • /
    • 2018
  • For government and public institutions, civil complaints containing direct requirements of citizens can be utilized as important data in developing policies. However, it is difficult to draw accurate requirements using text mining methods since the nature of the complaint text is unstructured. In this study, a new method is proposed that draws the exact requirements of citizens, improving the previous text mining in analyzing the data of civil complaints. The new text-mining method is based on the principle of Co-Occurrences Structure Map, and it is structured by two-step association analysis, so that it consists of the first-order related word and a second-order related word based on the core subject word. For the analysis, 3,004 cases posted on the electronic bulletin board of Busan City for the year 2016 are used. This study's academic contribution suggests a method deriving the requirements of citizens from the civil affairs data. As a practical contribution, it also enables policy development using civil service data.

A Tree-structured XPath Query Reduction Scheme for Enhancing XML Query Processing Performance (XML 질의의 수행성능 향상을 위한 트리 구조 XPath 질의의 축약 기법에 관한 연구)

  • Lee, Min-Soo;Kim, Yun-Mi;Song, Soo-Kyung
    • The KIPS Transactions:PartD
    • /
    • v.14D no.6
    • /
    • pp.585-596
    • /
    • 2007
  • XML data generally consists of a hierarchical tree-structure which is reflected in mechanisms to store and retrieve XML data. Therefore, when storing XML data in the database, the hierarchical relationships among the XML elements are taken into consideration during the restructuring and storing of the XML data. Also, in order to support the search queries from the user, a mechanism is needed to compute the hierarchical relationship between the element structures specified by the query. The structural join operation is one solution to this problem, and is an efficient computation method for hierarchical relationships in an in database based on the node numbering scheme. However, in order to process a tree structured XML query which contains a complex nested hierarchical relationship it still needs to carry out multiple structural joins and results in another problem of having a high query execution cost. Therefore, in this paper we provide a preprocessing mechanism for effectively reducing the cost of multiple nested structural joins by applying the concept of equivalence classes and suggest a query path reduction algorithm to shorten the path query which consists of a regular expression. The mechanism is especially devised to reduce path queries containing branch nodes. The experimental results show that the proposed algorithm can reduce the time requited for processing the path queries to 1/3 of the original execution time.

Status and Quality Analysis on the Biodiversity Data of East Asian Vascular Plants Mobilized through the Global Biodiversity Information Facility (GBIF) (세계생물다양성정보기구(GBIF)에 출판된 동아시아 관속식물 생물다양성 정보 현황과 자료품질 분석)

  • Chang, Chin-Sung;Kwon, Shin-Young;Kim, Hui
    • Journal of Korean Society of Forest Science
    • /
    • v.110 no.2
    • /
    • pp.179-188
    • /
    • 2021
  • Biodiversity informatics applies information technology methods in organizing, accessing, visualizing, and analyzing primary biodiversity data and quantitative data management through the scientific names of accepted names and synonyms. We reviewed the GBIF data published by China, Japan, Taiwan, and internal institutes, such as NIBR, NIE, and KNA of the Republic of Korea, and assessed data in diverse aspects of data quality using BRAHMS software. Most data from four Asian countries have quality problems with the lack of data consistency and missing information on georeferenced data, collectors, collection date, and place names (gazetteers) or other invalid data forms. The major problem is that biodiversity management institutions in East Asia are using unstructured databases and simple spreadsheet-type data. Owing to the nature of the biodiversity information, if data relationships are not structured, it would be impossible to secure the data integrity of scientific names, human names, geographical names, literature, and ecological information. For data quality, it is essential to build data integrity for database management and training systems for taxonomists who are continuous data managers to correct errors. Thus, publishers in East Asia play an essential role not only in using specialized software to manage biodiversity data but also in developing structured databases and ensuring their integration and value within biodiversity publishing platforms.

An Intelligent Intrusion Detection Model

  • Han, Myung-Mook
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2003.09a
    • /
    • pp.224-227
    • /
    • 2003
  • The Intrsuion Detecion Systems(IDS) are required the accuracy, the adaptability, and the expansion in the information society to be changed quickly. Also, it is required the more structured, and intelligent IDS to protect the resource which is important and maintains a secret in the complicated network environment. The research has the purpose to build the model for the intelligent IDS, which creates the intrusion patterns. The intrusion pattern has extracted from the vast amount of data. To manage the large size of data accurately and efficiently, the link analysis and sequence analysis among the data mining techniqes are used to build the model creating the intrusion patterns. The model is consist of "Time based Traffic Model", "Host based Traffic Model", and "Content Model", which is produced the different intrusion patterns with each model. The model can be created the stable patterns efficiently. That is, we can build the intrusion detection model based on the intelligent systems. The rules prodeuced by the model become the rule to be represented the intrusion data, and classify the normal and abnormal users. The data to be used are KDD audit data.

  • PDF

Cancer Genomics Object Model: An Object Model for Cancer Research Using Microarray

  • Park, Yu-Rang;Lee, Hye-Won;Cho, Sung-Bum;Kim, Ju-Han
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2005.09a
    • /
    • pp.29-34
    • /
    • 2005
  • DNA microarray becomes a major tool for the investigation of global gene expression in all aspects of cancer and biomedical research. DNA microarray experiment generates enormous amounts of data and they are meaningful only in the context of a detailed description of microarrays, biomaterials, and conditions under which they were generated. MicroArray Gene Expression Data (MGED) society has established microarray standard for structured management of these diverse and large amount data. MGED MAGE-OM (MicroArray Gene Expression Object Model) is an object oriented data model, which attempts to define standard objects for gene expression. To assess the relevance of DNA microarray analysis of cancer research it is required to combine clinical and genomics data. MAGE-OM, however, does not have an appropriate structure to describe clinical information of cancer. For systematic integration of gene expression and clinical data, we create a new model, Cancer Genomics Object Model.

  • PDF

Tree-Dependent Components of Gene Expression Data for Clustering (유전자발현데이터의 군집분석을 위한 나무 의존 성분 분석)

  • Kim Jong-Kyoung;Choi Seung-Jin
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.06a
    • /
    • pp.4-6
    • /
    • 2006
  • Tree-dependent component analysis (TCA) is a generalization of independent component analysis (ICA), the goal of which is to model the multivariate data by a linear transformation of latent variables, while latent variables fit by a tree-structured graphical model. In contrast to ICA, TCA allows dependent structure of latent variables and also consider non-spanning trees (forests). In this paper, we present a TCA-based method of clustering gene expression data. Empirical study with yeast cell cycle-related data, yeast metaboiic shift data, and yeast sporulation data, shows that TCA is more suitable for gene clustering, compared to principal component analysis (PCA) as well as ICA.

  • PDF