• Title/Summary/Keyword: HADOOP

Search Result 398, Processing Time 0.029 seconds

Measuring Hadoop Optimality by Lorenz Curve (로렌츠 커브를 이용한 하둡 플랫폼의 최적화 지수)

  • Kim, Woo-Cheol;Baek, Changryong
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.2
    • /
    • pp.249-261
    • /
    • 2014
  • Ever increasing "Big data" can only be effectively processed by parallel computing. Parallel computing refers to a high performance computational method that achieves effectiveness by dividing a big query into smaller subtasks and aggregating results from subtasks to provide an output. However, it is well-known that parallel computing does not achieve scalability which means that performance is improved linearly by adding more computers because it requires a very careful assignment of tasks to each node and collecting results in a timely manner. Hadoop is one of the most successful platforms to attain scalability. In this paper, we propose a measurement for Hadoop optimization by utilizing a Lorenz curve which is a proxy for the inequality of hardware resources. Our proposed index takes into account the intrinsic overhead of Hadoop systems such as CPU, disk I/O and network. Therefore, it also indicates that a given Hadoop can be improved explicitly and in what capacity. Our proposed method is illustrated with experimental data and substantiated by Monte Carlo simulations.

A GPU-enabled Face Detection System in the Hadoop Platform Considering Big Data for Images (이미지 빅데이터를 고려한 하둡 플랫폼 환경에서 GPU 기반의 얼굴 검출 시스템)

  • Bae, Yuseok;Park, Jongyoul
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.1
    • /
    • pp.20-25
    • /
    • 2016
  • With the advent of the era of digital big data, the Hadoop platform has become widely used in various fields. However, the Hadoop MapReduce framework suffers from problems related to the increase of the name node's main memory and map tasks for the processing of large number of small files. In addition, a method for running C++-based tasks in the MapReduce framework is required in order to conjugate GPUs supporting hardware-based data parallelism in the MapReduce framework. Therefore, in this paper, we present a face detection system that generates a sequence file for images to process big data for images in the Hadoop platform. The system also deals with tasks for GPU-based face detection in the MapReduce framework using Hadoop Pipes. We demonstrate a performance increase of around 6.8-fold as compared to a single CPU process.

An Efficient Implementation of Mobile Raspberry Pi Hadoop Clusters for Robust and Augmented Computing Performance

  • Srinivasan, Kathiravan;Chang, Chuan-Yu;Huang, Chao-Hsi;Chang, Min-Hao;Sharma, Anant;Ankur, Avinash
    • Journal of Information Processing Systems
    • /
    • v.14 no.4
    • /
    • pp.989-1009
    • /
    • 2018
  • Rapid advances in science and technology with exponential development of smart mobile devices, workstations, supercomputers, smart gadgets and network servers has been witnessed over the past few years. The sudden increase in the Internet population and manifold growth in internet speeds has occasioned the generation of an enormous amount of data, now termed 'big data'. Given this scenario, storage of data on local servers or a personal computer is an issue, which can be resolved by utilizing cloud computing. At present, there are several cloud computing service providers available to resolve the big data issues. This paper establishes a framework that builds Hadoop clusters on the new single-board computer (SBC) Mobile Raspberry Pi. Moreover, these clusters offer facilities for storage as well as computing. Besides the fact that the regular data centers require large amounts of energy for operation, they also need cooling equipment and occupy prime real estate. However, this energy consumption scenario and the physical space constraints can be solved by employing a Mobile Raspberry Pi with Hadoop clusters that provides a cost-effective, low-power, high-speed solution along with micro-data center support for big data. Hadoop provides the required modules for the distributed processing of big data by deploying map-reduce programming approaches. In this work, the performance of SBC clusters and a single computer were compared. It can be observed from the experimental data that the SBC clusters exemplify superior performance to a single computer, by around 20%. Furthermore, the cluster processing speed for large volumes of data can be enhanced by escalating the number of SBC nodes. Data storage is accomplished by using a Hadoop Distributed File System (HDFS), which offers more flexibility and greater scalability than a single computer system.

Lambda Architecture Used Apache Kudu and Impala (Apache Kudu와 Impala를 활용한 Lambda Architecture 설계)

  • Hwang, Yun-Young;Lee, Pil-Won;Shin, Yong-Tae
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.9 no.9
    • /
    • pp.207-212
    • /
    • 2020
  • The amount of data has increased significantly due to advances in technology, and various big data processing platforms are emerging, to handle it. Among them, the most widely used platform is Hadoop developed by the Apache Software Foundation, and Hadoop is also used in the IoT field. However, the existing Hadoop-based IoT sensor data collection and analysis environment has a problem of overloading the name node due to HDFS' Small File, which is Hadoop's core project, and it is impossible to update or delete the imported data. This paper uses Apache Kudu and Impala to design Lambda Architecture. The proposed Architecture classifies IoT sensor data into Cold-Data and Hot-Data, stores it in storage according to each personality, and uses Batch-View created through Batch and Real-time View generated through Apache Kudu and Impala to solve problems in the existing Hadoop-based IoT sensor data collection analysis environment and shorten the time users access to the analyzed data.

External Merge Sorting in Tajo with Variable Server Configuration (매개변수 환경설정에 따른 타조의 외부합병정렬 성능 연구)

  • Lee, Jongbaeg;Kang, Woon-hak;Lee, Sang-won
    • Journal of KIISE
    • /
    • v.43 no.7
    • /
    • pp.820-826
    • /
    • 2016
  • There is a growing requirement for big data processing which extracts valuable information from a large amount of data. The Hadoop system employs the MapReduce framework to process big data. However, MapReduce has limitations such as inflexible and slow data processing. To overcome these drawbacks, SQL query processing techniques known as SQL-on-Hadoop were developed. Apache Tajo, one of the SQL-on-Hadoop techniques, was developed by a Korean development group. External merge sort is one of the heavily used algorithms in Tajo for query processing. The performance of external merge sort in Tajo is influenced by two parameters, sort buffer size and fanout. In this paper, we analyzed the performance of external merge sort in Tajo with various sort buffer sizes and fanouts. In addition, we figured out that there are two major causes of differences in the performance of external merge sort: CPU cache misses which increase as the sort buffer size grows; and the number of merge passes determined by fanout.

Development of Application to Deal with Large Data Using Hadoop for 3D Printer (하둡을 이용한 3D 프린터용 대용량 데이터 처리 응용 개발)

  • Lee, Kang Eun;Kim, Sungsuk
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.1
    • /
    • pp.11-16
    • /
    • 2020
  • 3D printing is one of the emerging technologies and getting a lot of attention. To do 3D printing, 3D model is first generated, and then converted to G-code which is 3D printer's operations. Facet, which is a small triangle, represents a small surface of 3D model. Depending on the height or precision of the 3D model, the number of facets becomes very large and so the conversion time from 3D model to G-code takes longer. Apach Hadoop is a software framework to support distributed processing for large data set and its application range gets widening. In this paper, Hadoop is used to do the conversion works time-efficient way. 2-phase distributed algorithm is developed first. In the algorithm, all facets are sorted according to its lowest Z-value, divided into N parts, and converted on several nodes independently. The algorithm is implemented in four steps; preprocessing - Map - Shuffling - Reduce of Hadoop. Finally, to show the performance evaluation, Hadoop systems are set up and converts testing 3D model while changing the height or precision.

Design and Implementation of Vehicle Route Tracking System using Hadoop-Based Bigdata Image Processing (하둡 기반 빅데이터 영상 처리를 통한 차량 이동경로 추적 시스템의 설계 및 구현)

  • Yang, Seongeun;Choi, Changyeol;Choi, Hwangkyu
    • Journal of Digital Contents Society
    • /
    • v.14 no.4
    • /
    • pp.447-454
    • /
    • 2013
  • As the surveillance CCTVs are increasing every year, big data image processing for the CCTV image data has become a hot issue. In this paper, we propose a Hadoop-based big data image processing technique to recognize a vehicle number from a large amount of automatic number plate images taken from CCTVs. We also implement the vehicle route tracking system that displays the moving path of the searched vehicle on Google Maps with the related information together. In order to evaluate the performance we compare and analysis the vehicle number recognition time for a lot of CCTV image data in Hadoop and the single PC environment.

Scaling of Hadoop Cluster for Cost-Effective Processing of MapReduce Applications (비용 효율적 맵리듀스 처리를 위한 클러스터 규모 설정)

  • Ryu, Woo-Seok
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.15 no.1
    • /
    • pp.107-114
    • /
    • 2020
  • This paper studies a method for estimating the scale of a Hadoop cluster to process big data as a cost-effective manner. In the case of medical institutions, demands for cloud-based big data analysis are increasing as medical records can be stored outside the hospital. This paper first analyze the Amazon EMR framework, which is one of the popular cloud-based big data framework. Then, this paper presents a efficiency model for scaling the Hadoop cluster to execute a Mapreduce application more cost-effectively. This paper also analyzes the factors that influence the execution of the Mapreduce application by performing several experiments under various conditions. The cost efficiency of the analysis of the big data can be increased by setting the scale of cluster with the most efficient processing time compared to the operational cost.

Development of G-code generating software for 3D printer in Hadoop (Hadoop에서 3D 프린팅용 G-code 생성 소프트웨어 개발)

  • Lee, Kyuyoung;Nam, Kiwon;Kim, Gunyoung;Kim, Sungsuk;Yang, Sun-Ok
    • Annual Conference of KIPS
    • /
    • 2017.04a
    • /
    • pp.78-80
    • /
    • 2017
  • 3D 프린터를 이용하여 출력을 하기 위해서는 3D 모델 데이터를 G-code로 변환하는 과정을 수행해야 한다. 일반적으로 3D 모델은 STL 파일 형식으로 저장되는데, 이 파일은 대개 삼각형 형식인 페이셋들의 좌표 데이터를 포함하고 있다. 만약 3D 모델의 크기가 커지거나 정밀도가 높아진다면, 페이셋의 수가 매우 많아지게 되고, 결과적으로 3D 모델에서 G-code로 변환하는 시간이 길어지게 된다. 본 논문에서는 널리 활용되고 있는 Hadoop에서 변환 소프트웨어를 개발하고자 하였다. Hadoop은 마스터 노드와 여러 데이터 노드들이 Map-Reduce 방식으로 작업을 수행한다. 이러한 노드들은 하둡 파일시스템(HDFS)을 공유할 수 있어 작업을 효율적으로 수행할 수 있다. 이에 본 논문에서는 이 시스템의 기능을 활용하여 기존에 개발된 분산 알고리즘을 변형한 후 이를 구현하고자 한다.

A Study on Big Data Platform Based on Hadoop for the Applications in Ship and Offshore Industry (조선 해양 산업에서의 응용을 위한 하둡 기반의 빅데이터 플랫폼 연구)

  • Kim, Seong-Hoon;Roh, Myung-Il;Kim, Ki-Su
    • Korean Journal of Computational Design and Engineering
    • /
    • v.21 no.3
    • /
    • pp.334-340
    • /
    • 2016
  • As Information Technology (IT) is developed constantly, big data is becoming important in various industries, including ship and offshore industry where a lot of data are being generated. However, it is difficult to apply big data to ship and offshore industry because there is no generalized platform for its application. Therefore, this study presents a big data platform based on the Hadoop for applications in ship and offshore industry. The Hadoop is one of the most popular big data technologies. The presented platform includes existing data of shipyard and is possible to manage and process the data. To check the applicability of the platform, it is applied to estimate the weight of offshore plant topsides. The result shows that the platform can be one of alternatives to use effectively big data in ship and offshore industry.