• Title/Summary/Keyword: MapReduce

Search Result 847, Processing Time 0.032 seconds

Big Numeric Data Classification Using Grid-based Bayesian Inference in the MapReduce Framework

  • Kim, Young Joon;Lee, Keon Myung
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.14 no.4
    • /
    • pp.313-321
    • /
    • 2014
  • In the current era of data-intensive services, the handling of big data is a crucial issue that affects almost every discipline and industry. In this study, we propose a classification method for large volumes of numeric data, which is implemented in a distributed programming framework, i.e., MapReduce. The proposed method partitions the data space into a grid structure and it then models the probability distributions of classes for grid cells by collecting sufficient statistics using distributed MapReduce tasks. The class labeling of new data is achieved by k-nearest neighbor classification based on Bayesian inference.

Task Assignment Policy for Hadoop Considering Availability of Nodes (노드의 가용성을 고려한 하둡 태스크 할당 정책)

  • Ryu, Wooseok
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.103-105
    • /
    • 2017
  • Hadoop MapReduce is a processing framework in which users' job can be efficiently processed in parallel and distributed ways on the Hadoop cluster. MapReduce task schedulers are used to select target nodes and assigns user's tasks to them. Previous schedulers cannot fully utilize resources of Hadoop cluster because they does not consider dynamic characteristics of cluster based on nodes' availability. To increase utilization of Hadoop cluster, this paper proposes a novel task assignment policy for MapReduce that assigns a job tasks to dynamic cluster efficiently by considering availability of each node.

  • PDF

Preprocessor of Scientific Experimental Data for MapReduce based Data Analysis (MapReduce 기반 데이터분석을 위한 과학실험데이터 전처리기)

  • Kang, Yun-Hee;Kang, Kyung-woo;Kung, Sang-wang;Jang, Haeng-Jin
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.11a
    • /
    • pp.118-120
    • /
    • 2012
  • 이 논문에서는 MapReduce 프레임워크를 활용한 기후 시뮬레이션 결과의 데이터분석을 위한 전처리 과정을 다룬다. 이를 위해 기후 시뮬레이션 결과 데이터 셋으로부터 특정변수를 추출하여 자료를 변환한 후 변환된 자료를 HDFS 에 저장하기 위한 과학데이터 필터를 설계한다. 설계된 필터를 통해 저장된 자료는 Hadoop 의 MapReduce 응용을 통해 연도별 통계처리를 분산병렬 방식으로 수행한다.

Distributed Incremental Approximate Frequent Itemset Mining Using MapReduce

  • Mohsin Shaikh;Irfan Ali Tunio;Syed Muhammad Shehram Shah;Fareesa Khan Sohu;Abdul Aziz;Ahmad Ali
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.5
    • /
    • pp.207-211
    • /
    • 2023
  • Traditional methods for datamining typically assume that the data is small, centralized, memory resident and static. But this assumption is no longer acceptable, because datasets are growing very fast hence becoming huge from time to time. There is fast growing need to manage data with efficient mining algorithms. In such a scenario it is inevitable to carry out data mining in a distributed environment and Frequent Itemset Mining (FIM) is no exception. Thus, the need of an efficient incremental mining algorithm arises. We propose the Distributed Incremental Approximate Frequent Itemset Mining (DIAFIM) which is an incremental FIM algorithm and works on the distributed parallel MapReduce environment. The key contribution of this research is devising an incremental mining algorithm that works on the distributed parallel MapReduce environment.

High-Speed Self-Organzing Map for Document Clustering

  • Rojanavasu, Ponthap;Pinngern, Ouen
    • 제어로봇시스템학회:학술대회논문집
    • /
    • 2003.10a
    • /
    • pp.1056-1059
    • /
    • 2003
  • Self-Oranizing Map(SOM) is an unsupervised neural network providing cluster analysis of high dimensional input data. The output from the SOM is represented in map that help us to explore data. The weak point of conventional SOM is when the map is large, it take a long time to train the data. The computing time is known to be O(MN) for trainning to find the winning node (M,N are the number of nodes in width and height of the map). This paper presents a new method to reduce the computing time by creating new map. Each node in a new map is the centroid of nodes' group that are in the original map. After create a new map, we find the winning node of this map, then find the winning node in original map only in nodes that are represented by the winning node from the new map. This new method is called "High Speed Self-Oranizing Map"(HS-SOM). Our experiment use HS-SOM to cluster documents and compare with SOM. The results from the experiment shows that HS-SOM can reduce computing time by 30%-50% over conventional SOM.

  • PDF

Analysis of big data using Rhipe (Rhipe를 활용한 빅데이터 처리 및 분석)

  • Ko, Youngjun;Kim, Jinseog
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.5
    • /
    • pp.975-987
    • /
    • 2013
  • The Hadoop system was developed by the Apache foundation based on GFS and MapReduce technologies of Google. Many modern systems for managing and processing the big data have been developing based on the Hadoop because the Hadoop was designed for scalability and distributed computing. The R software has been considered as a well-suited analytic tool in the Hadoop based systems because the R is flexible to other languages and has many libraries for complex analyses. We introduced Rhipe which is a R package supporting MapReduce programming easily under the Hadoop system, and implemented a MapReduce program using Rhipe for multiple regression especially. In addition, we compared the computing speeds of our program with the other packages (ff and bigmemory) for processing the large data. The simulation results showed that our program was more fast than ff and bigmemory as the size of data increases.

Efficient Computation of Data Cubes Using MapReduce (맵리듀스를 사용한 데이터 큐브의 효율적인 계산 기법)

  • Lee, Ki Yong;Park, Sojeong;Park, Eunju;Park, Jinkyung;Choi, Yeunjung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.11
    • /
    • pp.479-486
    • /
    • 2014
  • MapReduce is a programing model used for parallelly processing a large amount of data. To analyze a large amount data, the data cube is widely used, which is an operator that computes group-bys for all possible combinations of given dimension attributes. When the number of dimension attributes is n, the data cube computes $2^n$ group-bys. In this paper, we propose an efficient method for computing data cubes using MapReduce. The proposed method partitions $2^n$ group-bys into $_nC_{{\lceil}n/2{\rceil}}$ batches, and computes those batches in stages using ${\lceil}n/2{\rceil}$ MapReduce jobs. Compared to the existing methods, the proposed method significantly reduces the amount of intermediate data generated by mappers, so that the cost of sorting and transferring those intermediate data is reduced significantly. Consequently, the total processing time for computing a data cube is reduced. Through experiments, we show the efficiency of the proposed method over the existing methods.

Implement of MapReduce-based Big Data Processing Scheme for Reducing Big Data Processing Delay Time and Store Data (빅데이터 처리시간 감소와 저장 효율성이 향상을 위한 맵리듀스 기반 빅데이터 처리 기법 구현)

  • Lee, Hyeopgeon;Kim, Young-Woon;Kim, Ki-Young
    • Journal of the Korea Convergence Society
    • /
    • v.9 no.10
    • /
    • pp.13-19
    • /
    • 2018
  • MapReduce, the Hadoop's essential core technology, is most commonly used to process big data based on the Hadoop distributed file system. However, the existing MapReduce-based big data processing techniques have a feature of dividing and storing files in blocks predefined in the Hadoop distributed file system, thus wasting huge infrastructure resources. Therefore, in this paper, we propose an efficient MapReduce-based big data processing scheme. The proposed method enhances the storage efficiency of a big data infrastructure environment by converting and compressing the data to be processed into a data format in advance suitable for processing by MapReduce. In addition, the proposed method solves the problem of the data processing time delay arising from when implementing with focus on the storage efficiency.

An Efficient Data Replacement Algorithm for Performance Optimization of MapReduce in Non-dedicated Distributed Computing Environments (비-전용 분산 컴퓨팅 환경에서 맵-리듀스 처리 성능 최적화를 위한 효율적인 데이터 재배치 알고리즘)

  • Ryu, Eunkyung;Son, Ingook;Park, Junho;Bok, Kyoungsoo;Yoo, Jaesoo
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.9
    • /
    • pp.20-27
    • /
    • 2013
  • In recently years, with the growth of social media and the development of mobile devices, the data have been significantly increased. MapReduce is an emerging programming model that processes large amount of data. However, since MapReduce evenly places the data in the dedicated distributed computing environment, it is not suitable to the non-dedicated distributed computing environment. The data replacement algorithms were proposed for performance optimization of MapReduce in the non-dedicated distributed computing environments. However, they spend much time for date replacement and cause the network load for unnecessary data transmission. In this paper, we propose an efficient data replacement algorithm for the performance optimization of MapReduce in the non-dedicated distributed computing environments. The proposed scheme computes the ratio of data blocks in the nodes based on the node availability model and reduces the network load by transmitting the data blocks considering the data placement. Our experimental results show that the proposed scheme outperforms the existing scheme.

Improving Join Performance for SPARQL Query Processing in the Clouds (클라우드에서 SPARQL 질의 처리를 위한 조인 성능 향상)

  • Choi, Gyu-Jin;Son, Yun-Hee;Lee, Kyu-Chul
    • Journal of KIISE
    • /
    • v.43 no.6
    • /
    • pp.700-709
    • /
    • 2016
  • Recently, with the rapid growth of LOD (Linked Open Data) existing methods based on a single machine have limitation in performance. Existing solutions use distributed framework such as Mapreduce in order to improve the performance. However, the MapReduce framework for processing SPARQL queries involves multiple MapReduce jobs and additional costs incurred. In addition, the problem of unnecessary data processing arises. In this study, we proposed a method to reduce the number of MapReduce jobs during SPARQL query processing and join indexes based on Bitmap for minimizing the costs of processing unnecessary data.