• Title/Summary/Keyword: MapReduce

Search Result 848, Processing Time 0.024 seconds

An Algorithms for Tournament-based Big Data Analysis (토너먼트 기반의 빅데이터 분석 알고리즘)

  • Lee, Hyunjin
    • Journal of Digital Contents Society
    • /
    • v.16 no.4
    • /
    • pp.545-553
    • /
    • 2015
  • While all of the data has a value in itself, most of the data that is collected in the real world is a random and unstructured. In order to extract useful information from the data, it is need to use the data transform and analysis algorithms. Data mining is used for this purpose. Today, there is not only need for a variety of data mining techniques to analyze the data but also need for a computational requirements and rapid analysis time for huge volume of data. The method commonly used to store huge volume of data is to use the hadoop. A method for analyzing data in hadoop is to use the MapReduce framework. In this paper, we developed a tournament-based MapReduce method for high efficiency in developing an algorithm on a single machine to the MapReduce framework. This proposed method can apply many analysis algorithms and we showed the usefulness of proposed tournament based method to apply frequently used data mining algorithms k-means and k-nearest neighbor classification.

An Efficient MapReduce-based Skyline Query Processing Method with Two-level Grid Blocks (2-계층 그리드 블록을 이용한 효과적인 맵리듀스 기반 스카이라인 질의 처리 기법)

  • Ryu, Hyeongcheol;Jung, Sungwon
    • Journal of KIISE
    • /
    • v.44 no.6
    • /
    • pp.613-620
    • /
    • 2017
  • Skyline queries are used extensively to solve various problems, such as in decision-making, because they find data that meet a variety of user criteria. Recent research has focused on skyline queries by using the MapReduce framework for large database processing, mainly in terms of applying existing index structures to MapReduce. In a skyline, data closer to the origin dominate more area. However, the existing index structure does not reflect such characteristics of the skyline. In this paper, we propose a grid-block structure that groups grid cells to match the characteristics of a skyline, and a two-level grid-block structure that can be used even when there are no data close to the origin. We also propose an efficient skyline-query algorithm that uses the two-level grid-block structure.

Structural Change Detection Technique for RDF Data in MapReduce (맵리듀스에서의 구조적 RDF 데이터 변경 탐지 기법)

  • Lee, Taewhi;Im, Dong-Hyuk
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.8
    • /
    • pp.293-298
    • /
    • 2014
  • Detecting and understanding the changes between RDF data is crucial in the evolutionary process, synchronization system, and versioning system on the web of data. However, current researches on detecting changes still remain unsatisfactory in that they did neither consider the large scale of RDF data nor accurately produce the RDF deltas. In this paper, we propose a scalable and effective change detection using a MapReduce framework which has been used in many fields to process and analyze large volumes of data. In particular, we focus on the structure-based change detection that adopts a strategy for the comparison of blank nodes in RDF data. To achieve this, we employ a method which is composed of two MapReduce jobs. First job partitions the triples with blank nodes by grouping each triple with the same blank node ID and then computes the incoming path to the blank node. Second job partitions the triples with the same path and matchs blank nodes with the Hungarian method. In experiments, we show that our approach is more accurate and effective than the previous approach.

Learning algorithms for big data logistic regression on RHIPE platform (RHIPE 플랫폼에서 빅데이터 로지스틱 회귀를 위한 학습 알고리즘)

  • Jung, Byung Ho;Lim, Dong Hoon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.4
    • /
    • pp.911-923
    • /
    • 2016
  • Machine learning becomes increasingly important in the big data era. Logistic regression is a type of classification in machine leaning, and has been widely used in various fields, including medicine, economics, marketing, and social sciences. Rhipe that integrates R and Hadoop environment, has not been discussed by many researchers owing to the difficulty of its installation and MapReduce implementation. In this paper, we present the MapReduce implementation of Gradient Descent algorithm and Newton-Raphson algorithm for logistic regression using Rhipe. The Newton-Raphson algorithm does not require a learning rate, while Gradient Descent algorithm needs to manually pick a learning rate. We choose the learning rate by performing the mixed procedure of grid search and binary search for processing big data efficiently. In the performance study, our Newton-Raphson algorithm outpeforms Gradient Descent algorithm in all the tested data.

The Design of Blog Network Analysis System using Map/Reduce Programming Model (Map/Reduce를 이용한 블로그 연결망 분석 시스템 설계)

  • Joe, In-Whee;Park, Jae-Kyun
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.35 no.9B
    • /
    • pp.1259-1265
    • /
    • 2010
  • Recently, on-line social network has been increasing according to development of internet. The most representative service is blog. A Blog is a type of personal web site, usually maintained by an individual with regular entries of commentary. These blogs are related to each other, and it is called Blog Network in this paper. In a blog network, posts in a blog can be diffused to other blogs. Analyzing information diffusion in a blog world is a very useful research issue, which can be used for predicting information diffusion, abnormally detection, marketing, and revitalizing the blog world. Existing studies on network analysis have no consideration for the passage of time and these approaches can only measure network activity for a node by the number of direct connections that a given node has. As one solution, this paper suggests the new method of measuring the blog network activity using logistic curve model and Cosine-similarity in key words by the Map/Reduce programming model.

An Efficient Clustering Method based on Multi Centroid Set using MapReduce (맵리듀스를 이용한 다중 중심점 집합 기반의 효율적인 클러스터링 방법)

  • Kang, Sungmin;Lee, Seokjoo;Min, Jun-ki
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.7
    • /
    • pp.494-499
    • /
    • 2015
  • As the size of data increases, it becomes important to identify properties by analyzing big data. In this paper, we propose a k-Means based efficient clustering technique, called MCSKMeans (Multi centroid set k-Means), using distributed parallel processing framework MapReduce. A problem with the k-Means algorithm is that the accuracy of clustering depends on initial centroids created randomly. To alleviate this problem, the MCSK-Means algorithm reduces the dependency of initial centroids using sets consisting of k centroids. In addition, we apply the agglomerative hierarchical clustering technique for creating k centroids from centroids in m centroid sets which are the results of the clustering phase. In this paper, we implemented our MCSK-Means based on the MapReduce framework for processing big data efficiently.

A Parallel Approach for Accurate and High Performance Gridding of 3D Point Data (3D 점 데이터 그리딩을 위한 고성능 병렬처리 기법)

  • Lee, Changseop;Rizki, Permata Nur Miftahur;Lee, Heezin;Oh, Sangyoon
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.3 no.8
    • /
    • pp.251-260
    • /
    • 2014
  • 3D point data is utilized in various industry domains for its high accuracy to the surface information of an object. It is substantially utilized in geography for terrain scanning and analysis. Generally, 3D point data need to be changed by Gridding which produces a regularly spaced array of z values from irregularly spaced xyz data. But it requires long processing time and high resource cost to interpolate grid coordination. Kriging interpolation in Gridding has attracted because Kriging interpolation has more accuracy than other methods. However it haven't been used frequently since a processing is complex and slow. In this paper, we presented a parallel Gridding algorithm which contains Kriging and an application of grid data structure to fit MapReduce paradigm to this algorithm. Experiment was conducted for 1.6 and 4.3 billions of points from Airborne LiDAR files using our proposed MapReduce structure and the results show that the total execution time is decreased more than three times to the convention sequential program on three heterogenous clusters.

2D Grid Map Compensation using an ICP Algorithm (ICP 알고리즘을 이용한 2차원 격자지도 보정)

  • Lee, Dong-Ju;Hwang, Yu-Seop;Yun, Yeol-Min;Lee, Jang-Myung
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.20 no.11
    • /
    • pp.1170-1174
    • /
    • 2014
  • This paper suggests using the ICP (Iterative Closet Point) algorithm to compensate a two-dimensional map. ICP algorithm is a typical algorithm method using matching distance data. When building a two-dimensional map, using data through the value of a laser scanner, it occurred warping and distortion of a two-dimensional map because of the difference of distance from the value of the sensor. It uses the ICP algorithm in order to reduce any error of line. It validated the proposed method through experiment involving matching a two-dimensional map based reference data and measured the two-dimensional map.

Reduction of Inter-MAP Handoff Rate Based on 2-Layers in Hierarchical Mobile IPv6 (계층적 모바일 IP 네트워크에서 2 계층에 기반한 Inter-MAP Handoff Rate의 감소기법)

  • Jeong, Jong-Pil;Chung, Min-Young;Choo, Hyun-Seung
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2008.05a
    • /
    • pp.999-1002
    • /
    • 2008
  • Many schemes to reduce the inter-MAP handoff delay in hierarchical mobile IPv6 have been proposed but the previous schemes waste relatively large network resources to decrease the path rerouting delay. In this paper, we propose the 2-layered MAP concept, where the seamless inter-MAP handoff can be supported regardless of path rerouting time. As a result, the waste of wired resources and the rate of the inter-MAP handoff can be reduced. From the performance analysis and simulation, the inter-MAP handoff rate for non-real-time traffic is only about 1/3 of the conventional result. Such advantageous features of the proposed scheme neither incur any increase of the total handoff rate nor require additional MAPs.

SPARQL Query Processing in Distributed In-Memory System (분산 메모리 시스템에서의 SPARQL 질의 처리)

  • Jagvaral, Batselem;Lee, Wangon;Kim, Kang-Pil;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.42 no.9
    • /
    • pp.1109-1116
    • /
    • 2015
  • In this paper, we propose a query processing approach that uses the Spark functional programming and distributed memory system to solve the computational overhead of SPARQL. In the semantic web, RDF ontology data is produced at large scale, and the main challenge for the semantic web is to query and manipulate such a large ontology with a high throughput. The most existing studies on SPARQL have focused on deploying the Hadoop MapReduce framework, and although approaches based on Hadoop MapReduce have shown promising results, they achieve a low level of throughput due to the underlying distributed file processes. Therefore, in order to speed up the query processes, we suggest query- processing methods that are based on memory caching in distributed memory system. Our approach is also integrated with a clause unification method for propagating between the clauses that exploits Spark join, map and filter methods along with caching. In our experiments, we have achieved a high level of performance relative to other approaches. In particular, our performance was nearly similar to that of Sempala, which has been considered to be the fastest query processing system.