• Title/Summary/Keyword: Hadoop framework

Search Result 83, Processing Time 0.021 seconds

Sequential Pattern Mining with Optimization Calling MapReduce Function on MapReduce Framework (맵리듀스 프레임웍 상에서 맵리듀스 함수 호출을 최적화하는 순차 패턴 마이닝 기법)

  • Kim, Jin-Hyun;Shim, Kyu-Seok
    • The KIPS Transactions:PartD
    • /
    • v.18D no.2
    • /
    • pp.81-88
    • /
    • 2011
  • Sequential pattern mining that determines frequent patterns appearing in a given set of sequences is an important data mining problem with broad applications. For example, sequential pattern mining can find the web access patterns, customer's purchase patterns and DNA sequences related with specific disease. In this paper, we develop the sequential pattern mining algorithms using MapReduce framework. Our algorithms distribute input data to several machines and find frequent sequential patterns in parallel. With synthetic data sets, we did a comprehensive performance study with varying various parameters. Our experimental results show that linear speed up can be achieved through our algorithms with increasing the number of used machines.

Big Data Preprocessing for Predicting Box Office Success (영화 흥행 실적 예측을 위한 빅데이터 전처리)

  • Jun, Hee-Gook;Hyun, Geun-Soo;Lim, Kyung-Bin;Lee, Woo-Hyun;Kim, Hyoung-Joo
    • KIISE Transactions on Computing Practices
    • /
    • v.20 no.12
    • /
    • pp.615-622
    • /
    • 2014
  • The Korean film market has rapidly achieved an international scale, and this has led to a need for decision-making based on analytical methods that are more precise and appropriate. In this modern era, a highly advanced information environment can provide an overwhelming amount of data that is generated in real time, and this data must be properly handled and analyzed in order to extract useful information. In particular, the preprocessing of large data, which is the most time-consuming step, should be done in a reasonable amount of time. In this paper, we investigated a big data preprocessing method for predicting movie box office success. We analyzed the movie data characteristics for specialized preprocessing methods, and used the Hadoop MapReduce framework. The experimental results showed that the preprocessing methods using big data techniques are more effective than existing methods.

Comparison analysis of big data integration models (빅데이터 통합모형 비교분석)

  • Jung, Byung Ho;Lim, Dong Hoon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.4
    • /
    • pp.755-768
    • /
    • 2017
  • As Big Data becomes the core of the fourth industrial revolution, big data-based processing and analysis capabilities are expected to influence the company's future competitiveness. Comparative studies of RHadoop and RHIPE that integrate R and Hadoop environment, have not been discussed by many researchers although RHadoop and RHIPE have been discussed separately. In this paper, we constructed big data platforms such as RHadoop and RHIPE applicable to large scale data and implemented the machine learning algorithms such as multiple regression and logistic regression based on MapReduce framework. We conducted a study on performance and scalability with those implementations for various sample sizes of actual data and simulated data. The experiments demonstrated that our RHadoop and RHIPE can scale well and efficiently process large data sets on commodity hardware. We showed RHIPE is faster than RHadoop in almost all the data generally.

SPARQL Query Processing in Distributed In-Memory System (분산 메모리 시스템에서의 SPARQL 질의 처리)

  • Jagvaral, Batselem;Lee, Wangon;Kim, Kang-Pil;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.42 no.9
    • /
    • pp.1109-1116
    • /
    • 2015
  • In this paper, we propose a query processing approach that uses the Spark functional programming and distributed memory system to solve the computational overhead of SPARQL. In the semantic web, RDF ontology data is produced at large scale, and the main challenge for the semantic web is to query and manipulate such a large ontology with a high throughput. The most existing studies on SPARQL have focused on deploying the Hadoop MapReduce framework, and although approaches based on Hadoop MapReduce have shown promising results, they achieve a low level of throughput due to the underlying distributed file processes. Therefore, in order to speed up the query processes, we suggest query- processing methods that are based on memory caching in distributed memory system. Our approach is also integrated with a clause unification method for propagating between the clauses that exploits Spark join, map and filter methods along with caching. In our experiments, we have achieved a high level of performance relative to other approaches. In particular, our performance was nearly similar to that of Sempala, which has been considered to be the fastest query processing system.

MapReduce-based Localized Linear Regression for Electricity Price Forecasting (전기 가격 예측을 위한 맵리듀스 기반의 로컬 단위 선형회귀 모델)

  • Han, Jinju;Lee, Ingyu;On, Byung-Won
    • The Transactions of the Korean Institute of Electrical Engineers P
    • /
    • v.67 no.4
    • /
    • pp.183-190
    • /
    • 2018
  • Predicting accurate electricity prices is an important task in the electricity trading market. To address the electricity price forecasting problem, various approaches have been proposed so far and it is known that linear regression-based approaches are the best. However, the use of such linear regression-based methods is limited due to low accuracy and performance. In traditional linear regression methods, it is not practical to find a nonlinear regression model that explains the training data well. If the training data is complex (i.e., small-sized individual data and large-sized features), it is difficult to find the polynomial function with n terms as the model that fits to the training data. On the other hand, as a linear regression model approximating a nonlinear regression model is used, the accuracy of the model drops considerably because it does not accurately reflect the characteristics of the training data. To cope with this problem, we propose a new electricity price forecasting method that divides the entire dataset to multiple split datasets and find the best linear regression models, each of which is the optimal model in each dataset. Meanwhile, to improve the performance of the proposed method, we modify the proposed localized linear regression method in the map and reduce way that is a framework for parallel processing data stored in a Hadoop distributed file system. Our experimental results show that the proposed model outperforms the existing linear regression model. Specifically, the accuracy of the proposed method is improved by 45% and the performance is faster 5 times than the existing linear regression-based model.

A JobTracker Fault-tolerant Mechanism for MapReduce Framework (MapReduce 프레임워크를 위한 JobTracker 결함허용 메커니즘)

  • Hwang, Byung-Hyun;Park, Kie-Jin
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2010.06a
    • /
    • pp.317-318
    • /
    • 2010
  • 클라우드 컴퓨팅 서비스를 제공하기 위해서는 클라우드 컴퓨팅에 적합한 데이터 분산 저장 및 병렬 처리가 가능한 IT 인프라 구축이 필수적이다. 이를 위해서 분산 파일 시스템 중 하나인 HDFS(Hadoop File System)와 병렬 데이터 처리를 지원하기 위한 MapReduce 프레임워크 관련 연구가 각광 받고 있다. 하지만 MapReduce 프레임워크를 구성하는 JobTracker 노드는 SPoF(Single Point of Failure)이기 때문에, 작업 도중 JobTracker 노드의 결함이 발생하게 되면 전체 작업이 실패하게 된다. 위와 같은 문제를 해결하기 위해서 본 논문에서는 MapReduce 프레임워크의 JobTracker 노드 결함 발생에 대처할 수 있는 결함허용 메커니즘을 제안하였다.

  • PDF

The MapReduce framework for Large-scale Data Analysis: Overview and Research Trends (대규모 데이터 분석을 위한 MapReduce 기술의 연구 동향)

  • Lee, K.H.;Park, W.J.;Cho, K.S.;Ryu, W.
    • Electronics and Telecommunications Trends
    • /
    • v.28 no.6
    • /
    • pp.156-166
    • /
    • 2013
  • MapReduce는 다양한 형식의 대용량 데이터를 병렬 처리하는데 있어 효과적인 도구로 인식되고 있다. 특히 MapReduce의 오픈 소스 구현인 Hadoop은 여러 분야에서 널리 이용되고 있으며, 가장 대표적인 빅데이터 솔루션으로 현재까지 많은 주목을 받아오고 있다. 하지만, MapReduce는 그 구조적 특정으로 인한 이점과 함께 여러 제약과 단점들을 가진다. 이에 따라 MapReduce의 개선을 위한 많은 연구와 시스템 개량이 학계와 산업계에서 동시에 수행되어 왔다. 본고에서는 대용량 데이터 분석을 위한 MapReduce 프레임워크의 특성과 이를 개선하기 위한 최근의 연구 내용들을 소개한다. 또한 향후의 대용량 데이터 처리는 어떠한 모습을 취하게 될 것인지를 예측해 본다.

Kerberos Authentication Deployment Policy of US in Big data Environment (빅데이터 환경에서 미국 커버로스 인증 적용 정책)

  • Hong, Jinkeun
    • Journal of Digital Convergence
    • /
    • v.11 no.11
    • /
    • pp.435-441
    • /
    • 2013
  • This paper review about kerberos security authentication scheme and policy for big data service. It analyzed problem for security technology based on Hadoop framework in big data service environment. Also when it consider applying problem of kerberos security authentication system, it analyzed deployment policy in center of main contents, which is occurred in commercial business. About the related applied Kerberos policy in US, it is researched about application such as cross platform interoperability support, automated Kerberos set up, integration issue, OPT authentication, SSO, ID, and so on.

A Scalable OWL Horst Lite Ontology Reasoning Approach based on Distributed Cluster Memories (분산 클러스터 메모리 기반 대용량 OWL Horst Lite 온톨로지 추론 기법)

  • Kim, Je-Min;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.42 no.3
    • /
    • pp.307-319
    • /
    • 2015
  • Current ontology studies use the Hadoop distributed storage framework to perform map-reduce algorithm-based reasoning for scalable ontologies. In this paper, however, we propose a novel approach for scalable Web Ontology Language (OWL) Horst Lite ontology reasoning, based on distributed cluster memories. Rule-based reasoning, which is frequently used for scalable ontologies, iteratively executes triple-format ontology rules, until the inferred data no longer exists. Therefore, when the scalable ontology reasoning is performed on computer hard drives, the ontology reasoner suffers from performance limitations. In order to overcome this drawback, we propose an approach that loads the ontologies into distributed cluster memories, using Spark (a memory-based distributed computing framework), which executes the ontology reasoning. In order to implement an appropriate OWL Horst Lite ontology reasoning system on Spark, our method divides the scalable ontologies into blocks, loads each block into the cluster nodes, and subsequently handles the data in the distributed memories. We used the Lehigh University Benchmark, which is used to evaluate ontology inference and search speed, to experimentally evaluate the methods suggested in this paper, which we applied to LUBM8000 (1.1 billion triples, 155 gigabytes). When compared with WebPIE, a representative mapreduce algorithm-based scalable ontology reasoner, the proposed approach showed a throughput improvement of 320% (62k/s) over WebPIE (19k/s).

Real time predictive analytic system design and implementation using Bigdata-log (빅데이터 로그를 이용한 실시간 예측분석시스템 설계 및 구현)

  • Lee, Sang-jun;Lee, Dong-hoon
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.25 no.6
    • /
    • pp.1399-1410
    • /
    • 2015
  • Gartner is requiring companies to considerably change their survival paradigms insisting that companies need to understand and provide again the upcoming era of data competition. With the revealing of successful business cases through statistic algorithm-based predictive analytics, also, the conversion into preemptive countermeasure through predictive analysis from follow-up action through data analysis in the past is becoming a necessity of leading enterprises. This trend is influencing security analysis and log analysis and in reality, the cases regarding the application of the big data analysis framework to large-scale log analysis and intelligent and long-term security analysis are being reported file by file. But all the functions and techniques required for a big data log analysis system cannot be accommodated in a Hadoop-based big data platform, so independent platform-based big data log analysis products are still being provided to the market. This paper aims to suggest a framework, which is equipped with a real-time and non-real-time predictive analysis engine for these independent big data log analysis systems and can cope with cyber attack preemptively.