• Title/Summary/Keyword: Big data, Hadoop

Search Result 200, Processing Time 0.042 seconds

Rhipe Platform for Big Data Processing and Analysis (빅데이터 처리 및 분석을 위한 Rhipe 플랫폼)

  • Jung, Byung Ho;Shin, Ji Eun;Lim, Dong Hoon
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.7
    • /
    • pp.1171-1185
    • /
    • 2014
  • Rhipe that integrates R and Hadoop environment, made it possible to process and analyze massive amounts of data using a distributed processing environment. In this paper, we implemented multiple regression analysis using Rhipe with various data sizes of actual data and simulated data. Experimental results for comparing the computing speeds of pseudo-distributed and fully-distributed modes for configuring Hadoop cluster, showed fully-distributed mode was more fast than pseudo-distributed mode and computing speeds of fully-distributed mode were faster as the number of data nodes increases. We also compared the performance of our Rhipe with stats and biglm packages available on bigmemory. The results showed that our Rhipe was more fast than other packages owing to paralleling processing with increasing the number of map tasks as the size of data increases.

A Survey on the Performance Comparison of Map Reduce Technologies and the Architectural Improvement of Spark

  • Raghavendra, GS;Manasa, Bezwada;Vasavi, M.
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.5
    • /
    • pp.121-126
    • /
    • 2022
  • Hadoop and Apache Spark are Apache Software Foundation open source projects, and both of them are premier large data analytic tools. Hadoop has led the big data industry for five years. The processing velocity of the Spark can be significantly different, up to 100 times quicker. However, the amount of data handled varies: Hadoop Map Reduce can process data sets that are far bigger than Spark. This article compares the performance of both spark and map and discusses the advantages and disadvantages of both above-noted technologies.

On Implementing a Learning Environment for Big Data Processing using Raspberry Pi (라즈베리파이를 이용한 빅 데이터 처리 학습 환경 구축)

  • Hwang, Boram;Kim, Seonggyu
    • Journal of Digital Convergence
    • /
    • v.14 no.4
    • /
    • pp.251-258
    • /
    • 2016
  • Big data processing is a broad term for processing data sets so large or complex that traditional data processing applications are inadequate. Widespread use of smart devices results in a huge impact on the way we process data. Many organizations are contemplating how to incorporate or integrate those devices into their enterprise data systems. We have proposed a way to process big data by way of integrating Raspberry Pi into a Hadoop cluster as a computational grid. We have then shown the efficiency through several experiments and the ease of scaling of the proposed system.

A Parallel HDFS and MapReduce Functions for Emotion Analysis (감성분석을 위한 병렬적 HDFS와 맵리듀스 함수)

  • Back, BongHyun;Ryoo, Yun-Kyoo
    • Journal of the Korea society of information convergence
    • /
    • v.7 no.2
    • /
    • pp.49-57
    • /
    • 2014
  • Recently, opinion mining is introduced to extract useful information from SNS data and to evaluate the true intention of users. Opinion mining are required several efficient techniques to collect and analyze a large amount of SNS data and extract meaningful data from them. Therefore in this paper, we propose a parallel HDFS(Hadoop Distributed File System) and emotion functions based on Mapreduce to extract some emotional information of users from various unstructured big data on social networks. The experiment results have verified that the proposed system and functions perform faster than O(n) for data gathering time and loading time, and maintain stable load balancing for memory and CPU resources.

  • PDF

A Licence Plate Recognition System using Hadoop (하둡을 이용한 번호판 인식 시스템)

  • Park, Jin-Woo;Park, Ho-Hyun
    • Journal of IKEEE
    • /
    • v.21 no.2
    • /
    • pp.142-145
    • /
    • 2017
  • Currently, a trend in image processing is high-quality and high-resolution. The size and amount of image data are increasing exponentially because of the development of information and communication technology. Thus, license plate recognition with a single processor cannot handle the increasing data. This paper proposes a number plate recognition system using a distributed processing framework, Hadoop. Using SequenceFile format in Hadoop, each mapper performs a license plate recognition with a number of image data in a data block Experimental results show that license plate recognition performance with 16 data nodes accomplishes speedup of maximum 14.7 times comparing with one data node. In large dataset, the recognition performance is robust even if the number of data nodes increases gradually.

Learning algorithms for big data logistic regression on RHIPE platform (RHIPE 플랫폼에서 빅데이터 로지스틱 회귀를 위한 학습 알고리즘)

  • Jung, Byung Ho;Lim, Dong Hoon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.4
    • /
    • pp.911-923
    • /
    • 2016
  • Machine learning becomes increasingly important in the big data era. Logistic regression is a type of classification in machine leaning, and has been widely used in various fields, including medicine, economics, marketing, and social sciences. Rhipe that integrates R and Hadoop environment, has not been discussed by many researchers owing to the difficulty of its installation and MapReduce implementation. In this paper, we present the MapReduce implementation of Gradient Descent algorithm and Newton-Raphson algorithm for logistic regression using Rhipe. The Newton-Raphson algorithm does not require a learning rate, while Gradient Descent algorithm needs to manually pick a learning rate. We choose the learning rate by performing the mixed procedure of grid search and binary search for processing big data efficiently. In the performance study, our Newton-Raphson algorithm outpeforms Gradient Descent algorithm in all the tested data.

MapReduce-Based Partitioner Big Data Analysis Scheme for Processing Rate of Log Analysis (로그 분석 처리율 향상을 위한 맵리듀스 기반 분할 빅데이터 분석 기법)

  • Lee, Hyeopgeon;Kim, Young-Woon;Park, Jiyong;Lee, Jin-Woo
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.5
    • /
    • pp.593-600
    • /
    • 2018
  • Owing to the advancement of Internet and smart devices, access to various media such as social media became easy; thus, a large amount of big data is being produced. Particularly, the companies that provide various Internet services are analyzing the big data by using the MapReduce-based big data analysis techniques to investigate the customer preferences and patterns and strengthen the security. However, with MapReduce, when the big data is analyzed by defining the number of reducer objects generated in the reduce stage as one, the processing rate of big data analysis decreases. Therefore, in this paper, a MapReduce-based split big data analysis method is proposed to improve the log analysis processing rate. The proposed method separates the reducer partitioning stage and the analysis result combining stage and improves the big data processing rate by decreasing the bottleneck phenomenon by generating the number of reducer objects dynamically.

The Efficient Method of Parallel Genetic Algorithm using MapReduce of Big Data (빅 데이터의 MapReduce를 이용한 효율적인 병렬 유전자 알고리즘 기법)

  • Hong, Sung-Sam;Han, Myung-Mook
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.23 no.5
    • /
    • pp.385-391
    • /
    • 2013
  • Big Data is data of big size which is not processed, collected, stored, searched, analyzed by the existing database management system. The parallel genetic algorithm using the Hadoop for BigData technology is easily realized by implementing GA(Genetic Algorithm) using MapReduce in the Hadoop Distribution System. The previous study that the genetic algorithm using MapReduce is proposed suitable transforming for the GA by MapReduce. However, they did not show good performance because of frequently occurring data input and output. In this paper, we proposed the MRPGA(MapReduce Parallel Genetic Algorithm) using improvement Map and Reduce process and the parallel processing characteristic of MapReduce. The optimal solution can be found by using the topology, migration of parallel genetic algorithm and local search algorithm. The convergence speed of the proposal method is 1.5 times faster than that of the existing MapReduce SGA, and is the optimal solution can be found quickly by the number of sub-generation iteration. In addition, the MRPGA is able to improve the processing and analysis performance of Big Data technology.

A Novel Node Management in Hadoop Cluster by using DNA

  • Balaraju. J;PVRD. Prasada Rao
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.9
    • /
    • pp.134-140
    • /
    • 2023
  • The distributed system is playing a vital role in storing and processing big data and data generation is speedily increasing from various sources every second. Hadoop has a scalable, and efficient distributed system supporting commodity hardware by combining different networks in the topographical locality. Node support in the Hadoop cluster is rapidly increasing in different versions which are facing difficulty to manage clusters. Hadoop does not provide Node management, adding and deletion node futures. Node identification in a cluster completely depends on DHCP servers which managing IP addresses, hostname based on the physical address (MAC) address of each Node. There is a scope to the hacker to theft the data using IP or Hostname and creating a disturbance in a distributed system by adding a malicious node, assigning duplicate IP. This paper proposing novel node management for the distributed system using DNA hiding and generating a unique key using a unique physical address (MAC) of each node and hostname. The proposed mechanism is providing better node management for the Hadoop cluster providing adding and deletion node mechanism by using limited computations and providing better node security from hackers. The main target of this paper is to propose an algorithm to implement Node information hiding in DNA sequences to increase and provide security to the node from hackers.

Analysis of Factors for Korean Women's Cancer Screening through Hadoop-Based Public Medical Information Big Data Analysis (Hadoop기반의 공개의료정보 빅 데이터 분석을 통한 한국여성암 검진 요인분석 서비스)

  • Park, Min-hee;Cho, Young-bok;Kim, So Young;Park, Jong-bae;Park, Jong-hyock
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.10
    • /
    • pp.1277-1286
    • /
    • 2018
  • In this paper, we provide flexible scalability of computing resources in cloud environment and Apache Hadoop based cloud environment for analysis of public medical information big data. In fact, it includes the ability to quickly and flexibly extend storage, memory, and other resources in a situation where log data accumulates or grows over time. In addition, when real-time analysis of accumulated unstructured log data is required, the system adopts Hadoop-based analysis module to overcome the processing limit of existing analysis tools. Therefore, it provides a function to perform parallel distributed processing of a large amount of log data quickly and reliably. Perform frequency analysis and chi-square test for big data analysis. In addition, multivariate logistic regression analysis of significance level 0.05 and multivariate logistic regression analysis of meaningful variables (p<0.05) were performed. Multivariate logistic regression analysis was performed for each model 3.