• Title/Summary/Keyword: Hadoop framework

Search Result 83, Processing Time 0.023 seconds

Efficient Multimedia Data File Management and Retrieval Strategy on Big Data Processing System

  • Lee, Jae-Kyung;Shin, Su-Mi;Kim, Kyung-Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.8
    • /
    • pp.77-83
    • /
    • 2015
  • The storage and retrieval of multimedia data is becoming increasingly important in many application areas including record management, video(CCTV) management and Internet of Things (IoT). In these applications, the files containing multimedia that need to be stored and managed is tremendous and constantly scaling. In this paper, we propose a technique to retrieve a very large number of files, in multimedia format, using the Hadoop Framework. Our strategy is based on the management of metadata that describes the characteristic of files that are stored in Hadoop Distributed File System (HDFS). The metadata schema is represented in Hbase and looked up using SQL On Hadoop (Hive, Tajo). Both the Hbase, Hive and Tajo are part of the Hadoop Ecosystem. Preliminary experiment on multimedia data files stored in HDFS shows the viability of the proposed strategy.

Hadoop Security Technologies and Vulnerability Analysis (하둡 보안 기술과 취약점 분석)

  • Kim, A-Yong;He, Yilun;Kim, Han-Kil;Park, Man-Seub;Jung, Hoe-Kyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2013.05a
    • /
    • pp.681-683
    • /
    • 2013
  • And were the prevalence of smartphones is the Big Data era, such as Facebook or Twitter, SNS (Social Network Service) routine is used in the real world. Take advantage of the analysis, and to extract and utilize developed in the Apache Foundation Hadoop (Hadoop) without abandoning the SNS unstructured data here. Hadoop is an open source framework that can handle large amounts of data. Hadoop has been introduced in the domestic corporate and commercial development and Compared to the technology development Hadoop has been pointed out that the lack of security sector. In this paper, we propose a method to enhance the security and vulnerability analysis of security technologies and Hadoop.

  • PDF

A GPU-enabled Face Detection System in the Hadoop Platform Considering Big Data for Images (이미지 빅데이터를 고려한 하둡 플랫폼 환경에서 GPU 기반의 얼굴 검출 시스템)

  • Bae, Yuseok;Park, Jongyoul
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.1
    • /
    • pp.20-25
    • /
    • 2016
  • With the advent of the era of digital big data, the Hadoop platform has become widely used in various fields. However, the Hadoop MapReduce framework suffers from problems related to the increase of the name node's main memory and map tasks for the processing of large number of small files. In addition, a method for running C++-based tasks in the MapReduce framework is required in order to conjugate GPUs supporting hardware-based data parallelism in the MapReduce framework. Therefore, in this paper, we present a face detection system that generates a sequence file for images to process big data for images in the Hadoop platform. The system also deals with tasks for GPU-based face detection in the MapReduce framework using Hadoop Pipes. We demonstrate a performance increase of around 6.8-fold as compared to a single CPU process.

Design of a Large-scale Task Dispatching & Processing System based on Hadoop (하둡 기반 대규모 작업 배치 및 처리 기술 설계)

  • Kim, Jik-Soo;Cao, Nguyen;Kim, Seoyoung;Hwang, Soonwook
    • Journal of KIISE
    • /
    • v.43 no.6
    • /
    • pp.613-620
    • /
    • 2016
  • This paper presents a MOHA(Many-Task Computing on Hadoop) framework which aims to effectively apply the Many-Task Computing(MTC) technologies originally developed for high-performance processing of many tasks, to the existing Big Data processing platform Hadoop. We present basic concepts, motivation, preliminary results of PoC based on distributed message queue, and future research directions of MOHA. MTC applications may have relatively low I/O requirements per task. However, a very large number of tasks should be efficiently processed with potentially heavy inter-communications based on files. Therefore, MTC applications can show another pattern of data-intensive workloads compared to existing Hadoop applications, typically based on relatively large data block sizes. Through an effective convergence of MTC and Big Data technologies, we can introduce a new MOHA framework which can support the large-scale scientific applications along with the Hadoop ecosystem, which is evolving into a multi-application platform.

Task failure resilience technique for improving the performance of MapReduce in Hadoop

  • Kavitha, C;Anita, X
    • ETRI Journal
    • /
    • v.42 no.5
    • /
    • pp.748-760
    • /
    • 2020
  • MapReduce is a framework that can process huge datasets in parallel and distributed computing environments. However, a single machine failure during the runtime of MapReduce tasks can increase completion time by 50%. MapReduce handles task failures by restarting the failed task and re-computing all input data from scratch, regardless of how much data had already been processed. To solve this issue, we need the computed key-value pairs to persist in a storage system to avoid re-computing them during the restarting process. In this paper, the task failure resilience (TFR) technique is proposed, which allows the execution of a failed task to continue from the point it was interrupted without having to redo all the work. Amazon ElastiCache for Redis is used as a non-volatile cache for the key-value pairs. We measured the performance of TFR by running different Hadoop benchmarking suites. TFR was implemented using the Hadoop software framework, and the experimental results showed significant performance improvements when compared with the performance of the default Hadoop implementation.

Implementation and Performance Analysis of Hadoop MapReduce over Lustre Filesystem (러스터 파일 시스템 기반 하둡 맵리듀스 실행 환경 구현 및 성능 분석)

  • Kwak, Jae-Hyuck;Kim, Sangwan;Huh, Taesang;Hwang, Soonwook
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.8
    • /
    • pp.561-566
    • /
    • 2015
  • Hadoop is becoming widely adopted in scientific and commercial areas as an open-source distributed data processing framework. Recently, for real-time processing and analysis of data, an attempt to apply high-performance computing technologies to Hadoop is being made. In this paper, we have expanded the Hadoop Filesystem library to support Lustre, which is a popular high-performance parallel distributed filesystem, and implemented the Hadoop MapReduce execution environment over the Lustre filesystem. We analysed Hadoop MapReduce over Lustre by using Hadoop standard benchmark tools. We found that Hadoop MapReduce over Lustre execution has a performance 2-13 times better than a typical Hadoop MapReduce execution.

Implementaion of Video Processing Framework using Hadoop-based cloud computing (Hadoop 기반 클라우드 컴퓨팅을 이용한 영상 처리 프레임워크 구현)

  • Ryu, Chungmo;Lee, Daecheol;Jang, Minwook;Kim, Cheolgi
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.11a
    • /
    • pp.139-142
    • /
    • 2013
  • 최근 대용량 영상데이터로부터 정보 수집, 영상 처리를 위한 클라우드 관련 연구들이 활발하다. 그러나 공개 소프트웨어를 이용한 클라우드 연구의 대부분은 라이브러리 수준이 아닌 단순히 프로그램 수준의 조합으로 작동한다. 이런 이유로 단순 조합에 따른 비효율성에 의한 성능문제는 크게 다루어지지 않는다. 본 논문에서는 이 비효율성을 해결하는데 중점을 두고 FFmpeg과 Hadoop을 라이브러리 수준으로 결합하여 기존보다 더 나은 성능의 영상클라우드 환경을 구축하였다. C기반의 영상처리 라이브러리인 FFmpeg와 JAVA기반의 클라우드 환경 Hadoop의 결합을 위해 JNI(Java Native Interface)를 이용하였다. 상세구현으로는 HDFS(Hadoop Distributed File System)을 확장하여 Hadoop MapReduce가 직접 FFmpeg을 통한 영상파일 접근이 가능하게 하였다. 이로써 FFmpeg과 Hadoop간 상이한 파일 접근 방식에서 발생하는 불필요한 작업에 의한 시스템의 성능저하를 막았다. 또한 응용의 확장성을 위해 영상작업시 작업영상을 영상처리의 최소단위인 GOP(Group of Pictures)단위로 잘라 클라우드의 노드들에게 분산시켰다. 결과적으로 기존에 존재하는 Hadoop과 FFmpeg을 프로그램적으로 결합한 영상처리 클라우드보다 총 처리시간을 앞당겼고, GOP 단위의 영상 처리는 영상기반 작업에 안정성과 응용의 확장성을 보장해주었다.

Task Assignment Policy for Hadoop Considering Availability of Nodes (노드의 가용성을 고려한 하둡 태스크 할당 정책)

  • Ryu, Wooseok
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.103-105
    • /
    • 2017
  • Hadoop MapReduce is a processing framework in which users' job can be efficiently processed in parallel and distributed ways on the Hadoop cluster. MapReduce task schedulers are used to select target nodes and assigns user's tasks to them. Previous schedulers cannot fully utilize resources of Hadoop cluster because they does not consider dynamic characteristics of cluster based on nodes' availability. To increase utilization of Hadoop cluster, this paper proposes a novel task assignment policy for MapReduce that assigns a job tasks to dynamic cluster efficiently by considering availability of each node.

  • PDF

Scaling of Hadoop Cluster for Cost-Effective Processing of MapReduce Applications (비용 효율적 맵리듀스 처리를 위한 클러스터 규모 설정)

  • Ryu, Woo-Seok
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.15 no.1
    • /
    • pp.107-114
    • /
    • 2020
  • This paper studies a method for estimating the scale of a Hadoop cluster to process big data as a cost-effective manner. In the case of medical institutions, demands for cloud-based big data analysis are increasing as medical records can be stored outside the hospital. This paper first analyze the Amazon EMR framework, which is one of the popular cloud-based big data framework. Then, this paper presents a efficiency model for scaling the Hadoop cluster to execute a Mapreduce application more cost-effectively. This paper also analyzes the factors that influence the execution of the Mapreduce application by performing several experiments under various conditions. The cost efficiency of the analysis of the big data can be increased by setting the scale of cluster with the most efficient processing time compared to the operational cost.

Advanced Resource Management with Access Control for Multitenant Hadoop

  • Won, Heesun;Nguyen, Minh Chau;Gil, Myeong-Seon;Moon, Yang-Sae
    • Journal of Communications and Networks
    • /
    • v.17 no.6
    • /
    • pp.592-601
    • /
    • 2015
  • Multitenancy has gained growing importance with the development and evolution of cloud computing technology. In a multitenant environment, multiple tenants with different demands can share a variety of computing resources (e.g., CPU, memory, storage, network, and data) within a single system, while each tenant remains logically isolated. This useful multitenancy concept offers highly efficient, and cost-effective systems without wasting computing resources to enterprises requiring similar environments for data processing and management. In this paper, we propose a novel approach supporting multitenancy features for Apache Hadoop, a large scale distributed system commonly used for processing big data. We first analyze the Hadoop framework focusing on "yet another resource negotiator (YARN)", which is responsible for managing resources, application runtime, and access control in the latest version of Hadoop. We then define the problems for supporting multitenancy and formally derive the requirements to solve these problems. Based on these requirements, we design the details of multitenant Hadoop. We also present experimental results to validate the data access control and to evaluate the performance enhancement of multitenant Hadoop.