• Title/Summary/Keyword: Hadoop System

Search Result 235, Processing Time 0.025 seconds

MapReduce-Based Partitioner Big Data Analysis Scheme for Processing Rate of Log Analysis (로그 분석 처리율 향상을 위한 맵리듀스 기반 분할 빅데이터 분석 기법)

  • Lee, Hyeopgeon;Kim, Young-Woon;Park, Jiyong;Lee, Jin-Woo
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.5
    • /
    • pp.593-600
    • /
    • 2018
  • Owing to the advancement of Internet and smart devices, access to various media such as social media became easy; thus, a large amount of big data is being produced. Particularly, the companies that provide various Internet services are analyzing the big data by using the MapReduce-based big data analysis techniques to investigate the customer preferences and patterns and strengthen the security. However, with MapReduce, when the big data is analyzed by defining the number of reducer objects generated in the reduce stage as one, the processing rate of big data analysis decreases. Therefore, in this paper, a MapReduce-based split big data analysis method is proposed to improve the log analysis processing rate. The proposed method separates the reducer partitioning stage and the analysis result combining stage and improves the big data processing rate by decreasing the bottleneck phenomenon by generating the number of reducer objects dynamically.

Handling Streaming Data by Using Open Source Framework Storm in IoT Environment (오픈소스 프레임워크 Storm을 활용한 IoT 환경 스트리밍 데이터 처리)

  • Kang, Yunhee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.7
    • /
    • pp.313-318
    • /
    • 2016
  • To utilize sensory data, it is necessary to design architecture for processing and handling data generated from sensors in an IoT environment. Especially in the IoT environment, a thing connects to the Internet and efficiently enables to communicate a device with diverse sensors. But Hadoop and Twister based on MapReduce are good at handling data in a batch processing. It has a limitation for processing stream data from a sensor in a motion. Traditional streaming data processing has been mainly applied a MoM based message queuing system. It has maintainability and scalability problems because a programmer should consider details related with complex messaging flow. In this paper architecture is designed to handle sensory data aggregated The designed software architecture is used to operate an application on the open source framework Storm. The application is conceptually used to transform streaming data which aggregated via sensor gateway by pipe-filter style.

Framework for Efficient Web Page Prediction using Deep Learning

  • Kim, Kyung-Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.12
    • /
    • pp.165-172
    • /
    • 2020
  • Recently, due to exponential growth of access information on the web, the importance of predicting a user's next web page use has been increasing. One of the methods that can be used for predicting user's next web page is deep learning. To predict next web page, web logs are analyzed by data preprocessing and then a user's next web page is predicted on the output of the analyzed web logs using a deep learning algorithm. In this paper, we propose a framework for web page prediction that includes methods for web log preprocessing followed by deep learning techniques for web prediction. To increase the speed of preprocessing of large web log, a Hadoop based MapReduce programming model is used. In addition, we present a web prediction system that uses an efficient deep learning technique on the output of web log preprocessing for training and prediction. Through experiment, we show the performance improvement of our proposed method over traditional methods. We also show the accuracy of our prediction.

A Design of the Small File Grouping System Based on Naive Bayesian Classifier Model (나이브 베이지안 분류기 모델 기반의 소용량 파일 그룹화 시스템 설계)

  • Kim, Min-Jae;Kim, Kyung-Tae;Youn, Hee-Young
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2014.07a
    • /
    • pp.221-222
    • /
    • 2014
  • 빠른 웹의 성장으로 대용량 데이터를 효과적으로 처리할 수 있는 플랫폼 기술에 대한 관심이 높아지고 있다. 특히, HDFS는 이상적인 분산 파일 시스템으로 각광받고 있으며 대용량 파일의 처리를 목적으로 개발되었다. 하지만, 실제 파일들의 집합에서 소용량 파일이 차지하는 비중은 높은 편이다. 많은 수의 소용량 파일은 HDFS 성능 감소에 치명적인 원인이 된다. 많은 수의 소용량 파일들이 HDFS에 저장된다면 NameNode의 메모리 소비량이 증가하게 되며 많은 수의 소용량 파일은 많은 수의 DataNode와 NameNode를 요구하므로 상대적으로 처리시간이 많이 소모된다. 따라서 본 논문에서는 HDFS에서 소용량 파일의 저장과 액세스 효율성을 향상시키기 위하여 나이브 베이지안 분류기 알고리즘을 적용한 파일 그룹화 시스템을 설계하였다.

  • PDF

Research of Performance Interference Control Technique for Heterogeneous Services in Bigdata Platform (빅데이터 플랫폼에서 이종 서비스간 성능 간섭 현상 제어에 관한 연구)

  • Jin, Kisung;Lee, Sangmin;Kim, Youngkyun
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.6
    • /
    • pp.284-289
    • /
    • 2016
  • In the Hadoop-based Big Data analysis model, the data movement between the legacy system and the analysis system is difficult to avoid. To overcome this problem, a unified Big Data file system is introduced so that a unified platform can support the legacy service as well as the analysis service. However, major challenges in avoiding the performance degradation problem due to the interference of two services remain. In order to solve this problem, we first performed a real-life simulation and observed resource utilization, workload characteristics and I/O balanced level. Based on this analysis, two solutions were proposed both for the system level and for the technical level. In the system level, we divide I/O path into the legacy I/O path and the analysis I/O path. In the technical level, we introduce an aggressive prefetch method for analysis service which requires the sequential read. Also, we introduce experimental results that shows the outstanding performance gain comparing the previous system.

Performance Comparison of Spatial Split Algorithms for Spatial Data Analysis on Spark (Spark 기반 공간 분석에서 공간 분할의 성능 비교)

  • Yang, Pyoung Woo;Yoo, Ki Hyun;Nam, Kwang Woo
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.25 no.1
    • /
    • pp.29-36
    • /
    • 2017
  • In this paper, we implement a spatial big data analysis prototype based on Spark which is an in-memory system and compares the performance by the spatial split algorithm on this basis. In cluster computing environments, big data is divided into blocks of a certain size order to balance the computing load of big data. Existing research showed that in the case of the Hadoop based spatial big data system, the split method by spatial is more effective than the general sequential split method. Hadoop based spatial data system stores raw data as it is in spatial-divided blocks. However, in the proposed Spark-based spatial analysis system, there is a difference that spatial data is converted into a memory data structure and stored in a spatial block for search efficiency. Therefore, in this paper, we propose an in-memory spatial big data prototype and a spatial split block storage method. Also, we compare the performance of existing spatial split algorithms in the proposed prototype. We presented an appropriate spatial split strategy with the Spark based big data system. In the experiment, we compared the query execution time of the spatial split algorithm, and confirmed that the BSP algorithm shows the best performance.

Learning System for Big Data Analysis based on the Raspberry Pi Board (라즈베리파이 보드 기반의 빅데이터 분석을 위한 학습 시스템)

  • Kim, Young-Geun;Jo, Min-Hui;Kim, Won-Jung
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.11 no.4
    • /
    • pp.433-440
    • /
    • 2016
  • In order to construct a system for big data processing, one needs to configure the node by using network equipments to connect multiple computers or establish cloud environments through virtual hosts on a single computer. However, there are many restrictions on constructing the big data analysis system including complex system configuration and cost. These constraints are becoming a major obstacle to professional manpower training for big data areas which is emerging as one of the most important national competitiveness. As a result, for professional manpower training of big data areas, this paper proposes a Raspberry Pi Board based educational big data processing system which is capable of practical training at an affordable price.

A Design of Satisfaction Analysis System For Content Using Opinion Mining of Online Review Data (온라인 리뷰 데이터의 오피니언마이닝을 통한 콘텐츠 만족도 분석 시스템 설계)

  • Kim, MoonJi;Song, EunJeong;Kim, YoonHee
    • Journal of Internet Computing and Services
    • /
    • v.17 no.3
    • /
    • pp.107-113
    • /
    • 2016
  • Following the recent advancement in the use of social networks, a vast amount of different online reviews is created. These variable online reviews which provide feedback data of contents' are being used as sources of valuable information to both contents' users and providers. With the increasing importance of online reviews, studies on opinion mining which analyzes online reviews to extract opinions or evaluations, attitudes and emotions of the writer have been on the increase. However, previous sentiment analysis techniques of opinion-mining focus only on the classification of reviews into positive or negative classes but does not include detailed information analysis of the user's satisfaction or sentiment grounds. Also, previous designs of the sentiment analysis technique only applied to one content domain that is, either product or movie, and could not be applied to other contents from a different domain. This paper suggests a sentiment analysis technique that can analyze detailed satisfaction of online reviews and extract detailed information of the satisfaction level. The proposed technique can analyze not only one domain of contents but also a variety of contents that are not from the same domain. In addition, we design a system based on Hadoop to process vast amounts of data quickly and efficiently. Through our proposed system, both users and contents' providers will be able to receive feedback information more clearly and in detail. Consequently, potential users who will use the content can make effective decisions and contents' providers can quickly apply the users' responses when developing marketing strategy as opposed to the old methods of using surveys. Moreover, the system is expected to be used practically in various fields that require user comments.

SPARQL Query Processing System over Scalable Triple Data using SparkSQL Framework (SparQLing : SparkSQL 기반 대용량 트리플 데이터를 위한 SPARQL 질의 시스템 구축)

  • Jeon, MyungJoong;Hong, JinYoung;Park, YoungTack
    • Journal of KIISE
    • /
    • v.43 no.4
    • /
    • pp.450-459
    • /
    • 2016
  • Every year, RDFS data tends further toward scalability; hence, the manner of SPARQL processing needs to be changed for fast query. The query processing method of SPARQL has been studied using a scalable distributed processing framework. Current studies indicate that the query engine based on the scalable distributed processing framework i.e., Hadoop(MapReduce) is not suitable for real-time processing because of the repetitive tasks; in addition, it is difficult to construct a query engine based on an In-memory Distributed Query engine, because distributed structure on the low-level is required to be considered. In this paper, we proposed a method to construct a query engine for improving the speed of the query process with the mass triple data. The query engine processes the query of SPARQL using the SparkSQL, which is an In-memory based, distributed query processing framework. SparkSQL is a high-level distributed query engine that facilitates existing SQL statement. In order to process the SPARQL query, after generating the Algebra Tree using Jena, the Algebra Tree is required to be translated to Spark Algebra Tree for application in the Spark system, and construction of the system that generated the SparkSQL query. Furthermore, we proposed the design of triple property table based on DataFrame for more efficient query processing in the Spark system. Finally, we verified the validity through comparative evaluation with the query engine, which is the existing distributed processing framework.

Development of Smart Healthcare Wear System for Acquiring Vital Signs and Monitoring Personal Health (생체신호 습득과 건강 모니터링을 위한 스마트 헬스케어 의복 개발)

  • Joo, Moon-Il;Ko, Dong-Hee;Kim, Hee-Cheol
    • Journal of Korea Multimedia Society
    • /
    • v.19 no.5
    • /
    • pp.808-817
    • /
    • 2016
  • Recently, the wearable computing technology with bio-sensors has been rapidly developed and utilized in various areas such as personal health, care-giving for senior citizens who live alone, and sports activities. In particular, the wearable computing equipment to measure vital signs by means of digital yarns and bio sensors is noticeable. The wearable computing devices help users monitor and manage their health in their daily lives through the customized healthcare service. In this paper, we suggest a system for monitoring and analyzing vital signs utilizing smart healthcare clothing with bio-sensors. Vital signs that can be continuously acquired from the clothing is well-known as unstructured data. The amount of data is huge, and they are perceived as the big data. Vital sings are stored by Hadoop Distributed File System(HDFS), and one can build data warehouse for analyzing them in HDFS. We provide health monitoring system based on vital sings that are acquired by biosensors in smart healthcare clothing. We implemented a big data platform which provides health monitoring service to visualize and monitor clinical information and physical activities performed by the users.