• Title/Summary/Keyword: map-reduce

Search Result 853, Processing Time 0.027 seconds

K Nearest Neighbor Joins for Big Data Processing based on Spark (Spark 기반 빅데이터 처리를 위한 K-최근접 이웃 연결)

  • JIAQI, JI;Chung, Yeongjee
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.9
    • /
    • pp.1731-1737
    • /
    • 2017
  • K Nearest Neighbor Join (KNN Join) is a simple yet effective method in machine learning. It is widely used in small dataset of the past time. As the number of data increases, it is infeasible to run this model on an actual application by a single machine due to memory and time restrictions. Nowadays a popular batch process model called MapReduce which can run on a cluster with a large number of computers is widely used for large-scale data processing. Hadoop is a framework to implement MapReduce, but its performance can be further improved by a new framework named Spark. In the present study, we will provide a KNN Join implement based on Spark. With the advantage of its in-memory calculation capability, it will be faster and more effective than Hadoop. In our experiments, we study the influence of different factors on running time and demonstrate robustness and efficiency of our approach.

A Development Study of The VPT for the improvement of Hadoop performance (하둡 성능 향상을 위한 VPT 개발 연구)

  • Yang, Ill Deung;Kim, Seong Ryeol
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.9
    • /
    • pp.2029-2036
    • /
    • 2015
  • Hadoop MR(MapReduce) uses a partition function for passing the outputs of mappers to reducers. The partition function determines target reducers after calculating the hash-value from the key and performing mod-operation by reducer number. The legacy partition function doesn't divide the job effectively because it is so sensitive to key distribution. If the job isn't divided effectively then it can effect the total processing time of the job because some reducers need more time to process. This paper proposes the VPT(Virtual Partition Table) and has tested appling the VPT with a preponderance of data. The applied VPT improved three seconds on average and we figure it will improve more when data is increased.

Big Data Processing and Performance Improvement for Ship Trajectory using MapReduce Technique

  • Kim, Kwang-Il;Kim, Joo-Sung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.10
    • /
    • pp.65-70
    • /
    • 2019
  • In recently, ship trajectory data consisting of ship position, speed, course, and so on can be obtained from the Automatic Identification System device with which all ships should be equipped. These data are gathered more than 2GB every day at a crowed sea port and used for analysis of ship traffic statistic and patterns. In this study, we propose a method to process ship trajectory data efficiently with distributed computing resources using MapReduce algorithm. In data preprocessing phase, ship dynamic and static data are integrated into target dataset and filtered out ship trajectory that is not of interest. In mapping phase, we convert ship's position to Geohash code, and assign Geohash and ship MMSI to key and value. In reducing phase, key-value pairs are sorted according to the same key value and counted the ship traffic number in a grid cell. To evaluate the proposed method, we implemented it and compared it with IALA waterway risk assessment program(IWRAP) in their performance. The data processing performance improve 1 to 4 times that of the existing ship trajectory analysis program.

MapReduce-based Localized Linear Regression for Electricity Price Forecasting (전기 가격 예측을 위한 맵리듀스 기반의 로컬 단위 선형회귀 모델)

  • Han, Jinju;Lee, Ingyu;On, Byung-Won
    • The Transactions of the Korean Institute of Electrical Engineers P
    • /
    • v.67 no.4
    • /
    • pp.183-190
    • /
    • 2018
  • Predicting accurate electricity prices is an important task in the electricity trading market. To address the electricity price forecasting problem, various approaches have been proposed so far and it is known that linear regression-based approaches are the best. However, the use of such linear regression-based methods is limited due to low accuracy and performance. In traditional linear regression methods, it is not practical to find a nonlinear regression model that explains the training data well. If the training data is complex (i.e., small-sized individual data and large-sized features), it is difficult to find the polynomial function with n terms as the model that fits to the training data. On the other hand, as a linear regression model approximating a nonlinear regression model is used, the accuracy of the model drops considerably because it does not accurately reflect the characteristics of the training data. To cope with this problem, we propose a new electricity price forecasting method that divides the entire dataset to multiple split datasets and find the best linear regression models, each of which is the optimal model in each dataset. Meanwhile, to improve the performance of the proposed method, we modify the proposed localized linear regression method in the map and reduce way that is a framework for parallel processing data stored in a Hadoop distributed file system. Our experimental results show that the proposed model outperforms the existing linear regression model. Specifically, the accuracy of the proposed method is improved by 45% and the performance is faster 5 times than the existing linear regression-based model.

Framework for Efficient Web Page Prediction using Deep Learning

  • Kim, Kyung-Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.12
    • /
    • pp.165-172
    • /
    • 2020
  • Recently, due to exponential growth of access information on the web, the importance of predicting a user's next web page use has been increasing. One of the methods that can be used for predicting user's next web page is deep learning. To predict next web page, web logs are analyzed by data preprocessing and then a user's next web page is predicted on the output of the analyzed web logs using a deep learning algorithm. In this paper, we propose a framework for web page prediction that includes methods for web log preprocessing followed by deep learning techniques for web prediction. To increase the speed of preprocessing of large web log, a Hadoop based MapReduce programming model is used. In addition, we present a web prediction system that uses an efficient deep learning technique on the output of web log preprocessing for training and prediction. Through experiment, we show the performance improvement of our proposed method over traditional methods. We also show the accuracy of our prediction.

Compression Conversion and Storing of Large RDF datasets based on MapReduce (맵리듀스 기반 대량 RDF 데이터셋 압축 변환 및 저장 방법)

  • Kim, InA;Lee, Kyong-Ha;Lee, Kyu-Chul
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.4
    • /
    • pp.487-494
    • /
    • 2022
  • With the recent demand for analysis using data, the size of the knowledge graph, which is the data to be analyzed, gradually increased, reaching about 82 billion edges when extracted from the web as a knowledge graph. A lot of knowledge graphs are represented in the form of Resource Description Framework (RDF), which is a standard of W3C for representing metadata for web resources. Because of the characteristics of RDF, existing RDF storages have the limitations of processing time overhead when converting and storing large amounts of RDF data. To resolve these limitations, in this paper, we propose a method of compressing and converting large amounts of RDF data into integer IDs using MapReduce, and vertically partitioning and storing them. Our proposed method demonstrated a high performance improvement of up to 25.2 times compared to RDF-3X and up to 3.7 times compared to H2RDF+.

Conversion of Large RDF Data using Hash-based ID Mapping Tables with MapReduce Jobs (맵리듀스 잡을 사용한 해시 ID 매핑 테이블 기반 대량 RDF 데이터 변환 방법)

  • Kim, InA;Lee, Kyu-Chul
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.236-239
    • /
    • 2021
  • With the growth of AI technology, the scale of Knowledge Graphs continues to be expanded. Knowledge Graphs are mainly expressed as RDF representations that consist of connected triples. Many RDF storages compress and transform RDF triples into the condensed IDs. However, if we try to transform a large scale of RDF triples, it occurs the high processing time and memory overhead because it needs to search the large ID mapping table. In this paper, we propose the method of converting RDF triples using Hash-based ID mapping tables with MapReduce, which is the software framework with a parallel, distributed algorithm. Our proposed method not only transforms RDF triples into Integer-based IDs, but also improves the conversion speed and memory overhead. As a result of our experiment with the proposed method for LUBM, the size of the dataset is reduced by about 3.8 times and the conversion time was spent about 106 seconds.

  • PDF

A Comparative Study on the Noise Exposed Population for Noise Map Types (소음지도 형태에 따른 소음노출인구 비교 연구)

  • Park, In Sun;Park, Jae Sik;Park, Sang Kyu
    • Transactions of the Korean Society for Noise and Vibration Engineering
    • /
    • v.23 no.2
    • /
    • pp.99-104
    • /
    • 2013
  • Assessment of noise exposed population is to check the environmental noise level and social influence in order to reduce the risks such as annoyance and disturbance that are generated by environmental noise. Also, this method suggests the preferential noise abatement policy and action plan by accurately finding the area that the noise causes harmful effect to human health. Recently, a noise map, which can predict noise in comprehensive areas, is used for the assessment of noise exposed population, breaking from the methods using existing measures. In particular, countermeasure for the noise can be considered more effectively by using assessment methods of noise exposed population for specific noise levels, areas, and building types which are the main input factors in noise maps. In this study, assessment methods of noise exposed population by using 2 dimensional noise map are compared with those by 3 dimensional noise map.

Adaptive Self Organizing Feature Map (적응적 자기 조직화 형상지도)

  • Lee , Hyung-Jun;Kim, Soon-Hyob
    • The Journal of the Acoustical Society of Korea
    • /
    • v.13 no.6
    • /
    • pp.83-90
    • /
    • 1994
  • In this paper, we propose a new learning algorithm, ASOFM(Adaptive Self Organizing Feature Map), to solve the defects of Kohonen's Self Organiaing Feature Map. Kohonen's algorithm is sometimes stranded on local minima for the initial weights. The proposed algorithm uses an object function which can evaluate the state of network in learning and adjusts the learning rate adaptively according to the evaluation of the object function. As a result, it is always guaranteed that the state of network is converged to the global minimum value and it has a capacity of generalized learning by adaptively. It is reduce that the learning time of our algorithm is about $30\%$ of Kohonen's.

  • PDF

Reduction of GPS Latency Using RTK GPS/GNSS Correction and Map Matching in a Car NavigationSystem

  • Kim, Hyo Joong;Lee, Won Hee;Yu, Ki Yun
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.24 no.2
    • /
    • pp.37-46
    • /
    • 2016
  • The difference between definition time of GPS (Global Positioning System) position data and actual display time of car positions on a map could reduce the accuracy of car positions displayed in PND (Portable Navigation Device)-type CNS (Car Navigation System). Due to the time difference, the position of the car displayed on the map is not its current position, so an improved method to fix these problems is required. It is expected that a method that uses predicted future positionsto compensate for the delay caused by processing and display of the received GPS signals could mitigate these problems. Therefore, in this study an analysis was conducted to correct late processing problems of map positions by mapmatching using a Kalman filter with only GPS position data and a RRF (Road Reduction Filter) technique in a light-weight CNS. The effects on routing services are examined by analyzing differences that are decomposed into along and across the road elements relative to the direction of advancing car. The results indicate that it is possible to improve the positional accuracy in the along-the-road direction of a light-weight CNS device that uses only GPS position data, by applying a Kalman filter and RRF.