• Title/Summary/Keyword: Hadoop 클러스터

Search Result 58, Processing Time 0.032 seconds

A Study on the Data Collection Methods based Hadoop Distributed Environment (하둡 분산 환경 기반의 데이터 수집 기법 연구)

  • Jin, Go-Whan
    • Journal of the Korea Convergence Society
    • /
    • v.7 no.5
    • /
    • pp.1-6
    • /
    • 2016
  • Many studies have been carried out for the development of big data utilization and analysis technology recently. There is a tendency that government agencies and companies to introduce a Hadoop of a processing platform for analyzing big data is increasing gradually. Increased interest with respect to the processing and analysis of these big data collection technology of data has become a major issue in parallel to it. However, study of the collection technology as compared to the study of data analysis techniques, it is insignificant situation. Therefore, in this paper, to build on the Hadoop cluster is a big data analysis platform, through the Apache sqoop, stylized from relational databases, to collect the data. In addition, to provide a sensor through the Apache flume, a system to collect on the basis of the data file of the Web application, the non-structured data such as log files to stream. The collection of data through these convergence would be able to utilize as a basic material of big data analysis.

A Scalable OWL Horst Lite Ontology Reasoning Approach based on Distributed Cluster Memories (분산 클러스터 메모리 기반 대용량 OWL Horst Lite 온톨로지 추론 기법)

  • Kim, Je-Min;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.42 no.3
    • /
    • pp.307-319
    • /
    • 2015
  • Current ontology studies use the Hadoop distributed storage framework to perform map-reduce algorithm-based reasoning for scalable ontologies. In this paper, however, we propose a novel approach for scalable Web Ontology Language (OWL) Horst Lite ontology reasoning, based on distributed cluster memories. Rule-based reasoning, which is frequently used for scalable ontologies, iteratively executes triple-format ontology rules, until the inferred data no longer exists. Therefore, when the scalable ontology reasoning is performed on computer hard drives, the ontology reasoner suffers from performance limitations. In order to overcome this drawback, we propose an approach that loads the ontologies into distributed cluster memories, using Spark (a memory-based distributed computing framework), which executes the ontology reasoning. In order to implement an appropriate OWL Horst Lite ontology reasoning system on Spark, our method divides the scalable ontologies into blocks, loads each block into the cluster nodes, and subsequently handles the data in the distributed memories. We used the Lehigh University Benchmark, which is used to evaluate ontology inference and search speed, to experimentally evaluate the methods suggested in this paper, which we applied to LUBM8000 (1.1 billion triples, 155 gigabytes). When compared with WebPIE, a representative mapreduce algorithm-based scalable ontology reasoner, the proposed approach showed a throughput improvement of 320% (62k/s) over WebPIE (19k/s).

Performance Analysis of Distributed Parallel Processing Schemes for Large Data in Cloud Computing (클라우드 컴퓨팅에서의 대규모 데이터를 위한 분산 병렬 처리 기법의 성능분석)

  • Hong, Seung-Tae;Chang, Jae-Woo
    • Proceedings of the Korean Association of Geographic Inforamtion Studies Conference
    • /
    • 2010.09a
    • /
    • pp.111-118
    • /
    • 2010
  • 최근 IT 분야에서 인터넷을 기반으로 IT 자원들을 서비스 형태로 제공하는 클라우드 컴퓨팅에 대한 연구가 활발히 진행되고 있다. 한편, 효율적인 클라우드 컴퓨팅을 제공하기 위해서는, 막대한 양의 데이터를 수많은 서버들에 분산 처장하고 관리하기 위한 분산 데이터 처장 기법 빛 분산 병렬 처리 기법에 대한 연구가 필수적이다. 이를 위해 본 논문에서는 대표적인 분산 병렬 처리 기법에 대해 살펴보고, 이를 비교 분석한다. 마지막으로 Hadoop 기반 클러스터를 구축하고 이를 통해서 대규모 데이터를 위한 분산 병렬 처리 기법에 대한 성능평가를 수행한다.

  • PDF

Design of a Large-Scale Qualitative Spatial Reasoner Based on Hadoop Clusters (하둡 클러스터 기반의 대용량 정성 공간 추론기의 설계)

  • Kim, Jonghwan;Kim, Jonghoon;Kim, Incheol
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2015.10a
    • /
    • pp.1316-1319
    • /
    • 2015
  • 본 논문에서는 대규모 분산 병렬 컴퓨팅 환경인 하둡 클러스터 시스템을 이용하여, 공간 객체들 간의 위상 관계를 효율적으로 추론하는 대용량 정성 공간 추론기를 제안한다. 본 논문에서 제안하는 공간 추론기는 추론 작업의 순차성과 반복성을 고려하여, 작업들 간의 디스크 입출력을 최소화할 수 있는 인-메모리 기반의 아파치 스파크 프레임워크를 이용하여 개발하였다. 따라서 본 추론기에서는 추론의 대상이 되는 대용량 공간 지식들을 아파치 스파크의 분산 데이터 집합 형태인 PairRDD와 RDD로 변환하고, 이들에 대한 데이터 오퍼레이션들로 추론 작업들을 구현하였다. 또한, 본 추론기에서는 추론 시간의 많은 부분을 차지하는 이행 관계 추론에 필요한 조합표를 효과적으로 축소함으로써, 공간 추론 작업의 성능을 크게 향상시켰다. 대용량의 공간 지식 베이스를 이용한 성능 분석 실험을 통해, 본 논문에서 제안한 정성 공간 추론기의 높은 성능을 확인할 수 있었다.

Implement of Job Processing Using GPU for Hadoop Environment (하둡 환경에서 GPU를 사용한 Job 처리 방법)

  • Hong, Seok-min;Yoo, Yeon-jun;Lee, Hyeop Geon;Kim, Young Woon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.11a
    • /
    • pp.77-79
    • /
    • 2022
  • IT기술이 발전함에 따라 전 세계 데이터의 규모는 매년 증가하고 있다. 빅데이터 플랫폼을 사용하는 기업들은 더욱 빠른 빅데이터 처리를 원하고 있다. 이에 본 논문은 하둡 환경에서 GPU를 사용한 Job 처리 방법을 제안한다. 제안하는 방법은 CPU, GPU 클러스터를 따로 구성하여 세 가지 크기로 분류한 Job들을 알맞은 클러스터에 할당하여 처리한다. 향후, 제안하는 방법의 실질적인 검증을 위해 실제 구현과 성능 평가가 필요하다.

Big Data Management Scheme using Property Information based on Cluster Group in adopt to Hadoop Environment (하둡 환경에 적합한 클러스터 그룹 기반 속성 정보를 이용한 빅 데이터 관리 기법)

  • Han, Kun-Hee;Jeong, Yoon-Su
    • Journal of Digital Convergence
    • /
    • v.13 no.9
    • /
    • pp.235-242
    • /
    • 2015
  • Social network technology has been increasing interest in the big data service and development. However, the data stored in the distributed server and not on the central server technology is easy enough to find and extract. In this paper, we propose a big data management techniques to minimize the processing time of information you want from the content server and the management server that provides big data services. The proposed method is to link the in-group data, classified data and groups according to the type, feature, characteristic of big data and the attribute information applied to a hash chain. Further, the data generated to extract the stored data in the distributed server to record time for improving the data index information processing speed of the data classification of the multi-attribute information imparted to the data. As experimental result, The average seek time of the data through the number of cluster groups was increased an average of 14.6% and the data processing time through the number of keywords was reduced an average of 13%.

Optimization and Performance Analysis of Cloud Computing Platform for Distributed Processing of Big Data (대용량 데이터의 분산 처리를 위한 클라우드 컴퓨팅 환경 최적화 및 성능평가)

  • Hong, Seung-Tae;Shin, Young-Sung;Chang, Jae-Woo
    • Spatial Information Research
    • /
    • v.19 no.4
    • /
    • pp.55-71
    • /
    • 2011
  • Recently, interest in cloud computing which provides IT resources as service form in IT field is increasing. As a result, much research has been done on the distributed data processing that store and manage a large amount of data in many servers. Meanwhile, in order to effectively utilize the spatial data which is rapidly increasing day by day with the growth of GIS technology, distributed processing of spatial data using cloud computing is essential. Therefore, in this paper, we review the representative distributed data processing techniques and we analyze the optimization requirements for performance improvement of the distributed processing techniques for a large amount of data. In addition, we uses the Hadoop and we evaluate the performance of the distributed data processing techniques for their optimization requirements.

Comparing Energy Efficiency of MPI and MapReduce on ARM based Cluster (ARM 클러스터에서 에너지 효율 향상을 위한 MPI와 MapReduce 모델 비교)

  • Maqbool, Jahanzeb;Rizki, Permata Nur;Oh, Sangyoon
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2014.01a
    • /
    • pp.9-13
    • /
    • 2014
  • The performance of large scale software applications has been automatically increasing for last few decades under the influence of Moore's law - the number of transistors on a microprocessor roughly doubled every eighteen months. However, on-chip transistors limitations and heating issues led to the emergence of multicore processors. The energy efficient ARM based System-on-Chip (SoC) processors are being considered for future high performance computing systems. In this paper, we present a case study of two widely used parallel programming models i.e. MPI and MapReduce on distributed memory cluster of ARM SoC development boards. The case study application, Black-Scholes option pricing equation, was parallelized and evaluated in terms of power consumption and throughput. The results show that the Hadoop implementation has low instantaneous power consumption that of MPI, but MPI outperforms Hadoop implementation by a factor of 1.46 in terms of total power consumption to execution time ratio.

  • PDF

Pre-arrangement Based Task Scheduling Scheme for Reducing MapReduce Job Processing Time (MapReduce 작업처리시간 단축을 위한 선 정렬 기반 태스크 스케줄링 기법)

  • Park, Jung Hyo;Kim, Jun Sang;Kim, Chang Hyeon;Lee, Won Joo;Jeon, Chang Ho
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.11
    • /
    • pp.23-30
    • /
    • 2013
  • In this paper, we propose pre-arrangement based task scheduling scheme to reduce MapReduce job processing time. If a task and data to be processed do not locate in same node, the data should be transmitted to node where the task is allocated on. In that case, a job processing time increases owing to data transmission time. To avoid that case, we schedule tasks into two steps. In the first step, tasks are sorted in the order of high data locality. In the second step, tasks are exchanged to improve their data localities based on a location information of data. In performance evaluation, we compare the proposed method based Hadoop with a default Hadoop on a small Hadoop cluster in term of the job processing time and the number of tasks sorted to node without data to be processed by them. The result shows that the proposed method lowers job processing time by around 18%. Also, we confirm that the number of tasks allocated to node without data to be processed by them decreases by around 25%.

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment (분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘)

  • Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.7
    • /
    • pp.667-680
    • /
    • 2016
  • As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.