• Title/Summary/Keyword: SQL-on-hadoop

Search Result 20, Processing Time 0.03 seconds

Performance Comparison of DW System Tajo Based on Hadoop and Relational DBMS (하둡 기반 DW시스템 타조와 관계형 DBMS의 성능 비교)

  • Liu, Chen;Ko, Junghyun;Yeo, Jeongmo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.9
    • /
    • pp.349-354
    • /
    • 2014
  • Since Hadoop which is the Big-data processing platform was announced, SQL-on-Hadoop is the spotlight as the technique to analyze data using SQL on Hadoop. Tajo created by Korean programmers has recently been promoted to Top-Level-Project status by the Apache in April and has been paid attention all around world. Despite a sensible change caused by Hadoop's appearance in DW market, researches of those performance is insufficient. Thus, this study has been conducted to help choose a DW solution based on SQL-on-Hadoop as progressing the test on comparison analysis of RDBMS and Tajo. It has shown that Tajo based on Hadoop is more superior than RDBMS if it is used with accurate strategy. In addition, open-source project Tajo is expected not only to achieve improvements in technique due to active participation of many developers but also to be in charge of an important role of DW in the filed of data analysis.

Security Threats and Review for SQL on Hadoop (SQL on Hadoop 기술 동향 및 보안 위협)

  • Youn, Han Jung;Suk, Sang Kee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2015.04a
    • /
    • pp.691-693
    • /
    • 2015
  • SQL on Hadoop 기술은 하둡 분산 파일 시스템에 저장된 데이터를 대상으로 SQL을 이용하여 사용자의 질의를 처리하는 기술이다. 기존의 Hadoop 시스템이 맵리듀스의 한계와 기존 시스템의 호환성으로 인해 RDBMS와 병행사용이 불가피하다는 단점을 SQL을 이용해 극복하고자 하는 것이다. 본 논문에서는 SQL on Hadoop의 대표적 프레임워크인 Hive와 Impala의 특징과, 연구동향에 대해 살펴보고 예상되는 보안 위협에 대해 고찰한다.

External Merge Sorting in Tajo with Variable Server Configuration (매개변수 환경설정에 따른 타조의 외부합병정렬 성능 연구)

  • Lee, Jongbaeg;Kang, Woon-hak;Lee, Sang-won
    • Journal of KIISE
    • /
    • v.43 no.7
    • /
    • pp.820-826
    • /
    • 2016
  • There is a growing requirement for big data processing which extracts valuable information from a large amount of data. The Hadoop system employs the MapReduce framework to process big data. However, MapReduce has limitations such as inflexible and slow data processing. To overcome these drawbacks, SQL query processing techniques known as SQL-on-Hadoop were developed. Apache Tajo, one of the SQL-on-Hadoop techniques, was developed by a Korean development group. External merge sort is one of the heavily used algorithms in Tajo for query processing. The performance of external merge sort in Tajo is influenced by two parameters, sort buffer size and fanout. In this paper, we analyzed the performance of external merge sort in Tajo with various sort buffer sizes and fanouts. In addition, we figured out that there are two major causes of differences in the performance of external merge sort: CPU cache misses which increase as the sort buffer size grows; and the number of merge passes determined by fanout.

Interoperability between NoSQL and RDBMS via Auto-mapping Scheme in Distributed Parallel Processing Environment (분산병렬처리 환경에서 오토매핑 기법을 통한 NoSQL과 RDBMS와의 연동)

  • Kim, Hee Sung;Lee, Bong Hwan
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.11
    • /
    • pp.2067-2075
    • /
    • 2017
  • Lately big data processing is considered as an emerging issue. As a huge amount of data is generated, data processing capability is getting important. In processing big data, both Hadoop distributed file system and unstructured date processing-based NoSQL data store are getting a lot of attention. However, there still exists problems and inconvenience to use NoSQL. In case of low volume data, MapReduce of NoSQL normally consumes unnecessary processing time and requires relatively much more data retrieval time than RDBMS. In order to address the NoSQL problem, in this paper, an interworking scheme between NoSQL and the conventional RDBMS is proposed. The developed auto-mapping scheme enables to choose an appropriate database (NoSQL or RDBMS) depending on the amount of data, which results in fast search time. The experimental results for a specific data set shows that the database interworking scheme reduces data searching time by 35% at the maximum.

Design of Spark SQL Based Framework for Advanced Analytics (Spark SQL 기반 고도 분석 지원 프레임워크 설계)

  • Chung, Jaehwa
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.10
    • /
    • pp.477-482
    • /
    • 2016
  • As being the advanced analytics indispensable on big data for agile decision-making and tactical planning in enterprises, distributed processing platforms, such as Hadoop and Spark which distribute and handle the large volume of data on multiple nodes, receive great attention in the field. In Spark platform stack, Spark SQL unveiled recently to make Spark able to support distributed processing framework based on SQL. However, Spark SQL cannot effectively handle advanced analytics that involves machine learning and graph processing in terms of iterative tasks and task allocations. Motivated by these issues, this paper proposes the design of SQL-based big data optimal processing engine and processing framework to support advanced analytics in Spark environments. Big data optimal processing engines copes with complex SQL queries that involves multiple parameters and join, aggregation and sorting operations in distributed/parallel manner and the proposing framework optimizes machine learning process in terms of relational operations.

An Efficient Design and Implementation of an MdbULPS in a Cloud-Computing Environment

  • Kim, Myoungjin;Cui, Yun;Lee, Hanku
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.9 no.8
    • /
    • pp.3182-3202
    • /
    • 2015
  • Flexibly expanding the storage capacity required to process a large amount of rapidly increasing unstructured log data is difficult in a conventional computing environment. In addition, implementing a log processing system providing features that categorize and analyze unstructured log data is extremely difficult. To overcome such limitations, we propose and design a MongoDB-based unstructured log processing system (MdbULPS) for collecting, categorizing, and analyzing log data generated from banks. The proposed system includes a Hadoop-based analysis module for reliable parallel-distributed processing of massive log data. Furthermore, because the Hadoop distributed file system (HDFS) stores data by generating replicas of collected log data in block units, the proposed system offers automatic system recovery against system failures and data loss. Finally, by establishing a distributed database using the NoSQL-based MongoDB, the proposed system provides methods of effectively processing unstructured log data. To evaluate the proposed system, we conducted three different performance tests on a local test bed including twelve nodes: comparing our system with a MySQL-based approach, comparing it with an Hbase-based approach, and changing the chunk size option. From the experiments, we found that our system showed better performance in processing unstructured log data.

Efficient Multimedia Data File Management and Retrieval Strategy on Big Data Processing System

  • Lee, Jae-Kyung;Shin, Su-Mi;Kim, Kyung-Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.8
    • /
    • pp.77-83
    • /
    • 2015
  • The storage and retrieval of multimedia data is becoming increasingly important in many application areas including record management, video(CCTV) management and Internet of Things (IoT). In these applications, the files containing multimedia that need to be stored and managed is tremendous and constantly scaling. In this paper, we propose a technique to retrieve a very large number of files, in multimedia format, using the Hadoop Framework. Our strategy is based on the management of metadata that describes the characteristic of files that are stored in Hadoop Distributed File System (HDFS). The metadata schema is represented in Hbase and looked up using SQL On Hadoop (Hive, Tajo). Both the Hbase, Hive and Tajo are part of the Hadoop Ecosystem. Preliminary experiment on multimedia data files stored in HDFS shows the viability of the proposed strategy.

SQL Data Transport Technique for Efficient Hybrid Data Processing on Distributed and Parallel Environment (분산 병렬 환경에서 효율적인 이종 데이터 처리를 위한 SQL 데이터 전송 기법)

  • Yang, HyeonSik;Baek, Naeun;Sung, Mirae;Chang, Jae-woo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2015.10a
    • /
    • pp.1102-1105
    • /
    • 2015
  • 인터넷 발전이 가속화되고 SNS가 보급된 이후 과거와는 비교할 수 없을 정도로 큰 데이터 트래픽이 발생하고 있다. 기존의 DBMS는 이를 효과적으로 처리할 수 없었기 때문에 Hadoop과 같은 NoSQL이 탄생하였고, 최근 NoSQL 및 기존 SQL DBMS의 협업을 통해 유연하고 강력한 데이터 관리를 수행하는 연구가 진행되었다. 효율적인 질의 처리를 위한 대표적인 연구로 SQL 기반 분산 병렬 질의 처리 기법과 Hive등이 존재한다. 그러나 기존의 기법은 분산 병렬 환경을 고려하지 않아 SQL DBMS의 질의 결과를 효율적으로 Hive에 전송하지 못한다. 본 논문에서는 SQL DBMS에서 Hive로의 효율적인 SQL 데이터 이동을 위해 네트워크 비용을 최소화하는 기법을 제안하고, 제안하는 기법의 우수성을 제시한다.

SPARQL Query Processing System over Scalable Triple Data using SparkSQL Framework (SparQLing : SparkSQL 기반 대용량 트리플 데이터를 위한 SPARQL 질의 시스템 구축)

  • Jeon, MyungJoong;Hong, JinYoung;Park, YoungTack
    • Journal of KIISE
    • /
    • v.43 no.4
    • /
    • pp.450-459
    • /
    • 2016
  • Every year, RDFS data tends further toward scalability; hence, the manner of SPARQL processing needs to be changed for fast query. The query processing method of SPARQL has been studied using a scalable distributed processing framework. Current studies indicate that the query engine based on the scalable distributed processing framework i.e., Hadoop(MapReduce) is not suitable for real-time processing because of the repetitive tasks; in addition, it is difficult to construct a query engine based on an In-memory Distributed Query engine, because distributed structure on the low-level is required to be considered. In this paper, we proposed a method to construct a query engine for improving the speed of the query process with the mass triple data. The query engine processes the query of SPARQL using the SparkSQL, which is an In-memory based, distributed query processing framework. SparkSQL is a high-level distributed query engine that facilitates existing SQL statement. In order to process the SPARQL query, after generating the Algebra Tree using Jena, the Algebra Tree is required to be translated to Spark Algebra Tree for application in the Spark system, and construction of the system that generated the SparkSQL query. Furthermore, we proposed the design of triple property table based on DataFrame for more efficient query processing in the Spark system. Finally, we verified the validity through comparative evaluation with the query engine, which is the existing distributed processing framework.