• Title/Summary/Keyword: Apache Spark Streaming

Search Result 7, Processing Time 0.019 seconds

Distributed Moving Objects Management System for a Smart Black Box

  • Lee, Hyunbyung;Song, Seokil
    • International Journal of Contents
    • /
    • v.14 no.1
    • /
    • pp.28-33
    • /
    • 2018
  • In this paper, we design and implement a distributed, moving objects management system for processing locations and sensor data from smart black boxes. The proposed system is designed and implemented based on Apache Kafka, Apache Spark & Spark Streaming, Hbase, HDFS. Apache Kafka is used to collect the data from smart black boxes and queries from users. Received location data from smart black boxes and queries from users becomes input of Apache Spark Streaming. Apache Spark Streaming preprocesses the input data for indexing. Recent location data and indexes are stored in-memory managed by Apache Spark. Old data and indexes are flushed into HBase later. We perform experiments to show the throughput of the index manager. Finally, we describe the implementation detail in Scala function level.

Real-Time Stock Price Prediction using Apache Spark (Apache Spark를 활용한 실시간 주가 예측)

  • Dong-Jin Shin;Seung-Yeon Hwang;Jeong-Joon Kim
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.23 no.4
    • /
    • pp.79-84
    • /
    • 2023
  • Apache Spark, which provides the fastest processing speed among recent distributed and parallel processing technologies, provides real-time functions and machine learning functions. Although official documentation guides for these functions are provided, a method for fusion of functions to predict a specific value in real time is not provided. Therefore, in this paper, we conducted a study to predict the value of data in real time by fusion of these functions. The overall configuration is collected by downloading stock price data provided by the Python programming language. And it creates a model of regression analysis through the machine learning function, and predicts the adjusted closing price among the stock price data in real time by fusing the real-time streaming function with the machine learning function.

An Abnormal Worker Movement Detection System Based on Data Stream Processing and Hierarchical Clustering

  • Duong, Dat Van Anh;Lan, Doi Thi;Yoon, Seokhoon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.14 no.4
    • /
    • pp.88-95
    • /
    • 2022
  • Detecting anomalies in human movement is an important task in industrial applications, such as monitoring industrial disasters or accidents and recognizing unauthorized factory intruders. In this paper, we propose an abnormal worker movement detection system based on data stream processing and hierarchical clustering. In the proposed system, Apache Spark is used for streaming the location data of people. A hierarchical clustering-based anomalous trajectory detection algorithm is designed for detecting anomalies in human movement. The algorithm is integrated into Apache Spark for detecting anomalies from location data. Specifically, the location information is streamed to Apache Spark using the message queuing telemetry transport protocol. Then, Apache Spark processes and stores location data in a data frame. When there is a request from a client, the processed data in the data frame is taken and put into the proposed algorithm for detecting anomalies. A real mobility trace of people is used to evaluate the proposed system. The obtained results show that the system has high performance and can be used for a wide range of industrial applications.

Processing large-scale data with Apache Spark (Apache Spark를 활용한 대용량 데이터의 처리)

  • Ko, Seyoon;Won, Joong-Ho
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.6
    • /
    • pp.1077-1094
    • /
    • 2016
  • Apache Spark is a fast and general-purpose cluster computing package. It provides a new abstraction named resilient distributed dataset, which is capable of support for fault tolerance while keeping data in memory. This type of abstraction results in a significant speedup compared to legacy large-scale data framework, MapReduce. In particular, Spark framework is suitable for iterative machine learning applications such as logistic regression and K-means clustering, and interactive data querying. Spark also supports high level libraries for various applications such as machine learning, streaming data processing, database querying and graph data mining thanks to its versatility. In this work, we introduce the concept and programming model of Spark as well as show some implementations of simple statistical computing applications. We also review the machine learning package MLlib, and the R language interface SparkR.

Performance Optimization Strategies for Fully Utilizing Apache Spark (아파치 스파크 활용 극대화를 위한 성능 최적화 기법)

  • Myung, Rohyoung;Yu, Heonchang;Choi, Sukyong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.7 no.1
    • /
    • pp.9-18
    • /
    • 2018
  • Enhancing performance of big data analytics in distributed environment has been issued because most of the big data related applications such as machine learning techniques and streaming services generally utilize distributed computing frameworks. Thus, optimizing performance of those applications at Spark has been actively researched. Since optimizing performance of the applications at distributed environment is challenging because it not only needs optimizing the applications themselves but also requires tuning of the distributed system configuration parameters. Although prior researches made a huge effort to improve execution performance, most of them only focused on one of three performance optimization aspect: application design, system tuning, hardware utilization. Thus, they couldn't handle an orchestration of those aspects. In this paper, we deeply analyze and model the application processing procedure of the Spark. Through the analyzed results, we propose performance optimization schemes for each step of the procedure: inner stage and outer stage. We also propose appropriate partitioning mechanism by analyzing relationship between partitioning parallelism and performance of the applications. We applied those three performance optimization schemes to WordCount, Pagerank, and Kmeans which are basic big data analytics and found nearly 50% performance improvement when all of those schemes are applied.

Squall: A Real-time Big Data Processing Framework based on TMO Model for Real-time Events and Micro-batch Processing (Squall: 실시간 이벤트와 마이크로-배치의 동시 처리 지원을 위한 TMO 모델 기반의 실시간 빅데이터 처리 프레임워크)

  • Son, Jae Gi;Kim, Jung Guk
    • Journal of KIISE
    • /
    • v.44 no.1
    • /
    • pp.84-94
    • /
    • 2017
  • Recently, the importance of velocity, one of the characteristics of big data (5V: Volume, Variety, Velocity, Veracity, and Value), has been emphasized in the data processing, which has led to several studies on the real-time stream processing, a technology for quick and accurate processing and analyses of big data. In this paper, we propose a Squall framework using Time-triggered Message-triggered Object (TMO) technology, a model that is widely used for processing real-time big data. Moreover, we provide a description of Squall framework and its operations under a single node. TMO is an object model that supports the non-regular real-time processing method for certain conditions as well as regular periodic processing for certain amount of time. A Squall framework can support the real-time event stream of big data and micro-batch processing with outstanding performances, as compared to Apache storm and Spark Streaming. However, additional development for processing real-time stream under multiple nodes that is common under most frameworks is needed. In conclusion, the advantages of a TMO model can overcome the drawbacks of Apache storm or Spark Streaming in the processing of real-time big data. The TMO model has potential as a useful model in real-time big data processing.

Combined time bound optimization of control, communication, and data processing for FSO-based 6G UAV aerial networks

  • Seo, Seungwoo;Ko, Da-Eun;Chung, Jong-Moon
    • ETRI Journal
    • /
    • v.42 no.5
    • /
    • pp.700-711
    • /
    • 2020
  • Because of the rapid increase of mobile traffic, flexible broadband supportive unmanned aerial vehicle (UAV)-based 6G mobile networks using free space optical (FSO) links have been recently proposed. Considering the advancements made in UAVs, big data processing, and artificial intelligence precision control technologies, the formation of an additional wireless network based on UAV aerial platforms to assist the existing fixed base stations of the mobile radio access network is considered a highly viable option in the near future. In this paper, a combined time bound optimization scheme is proposed that can adaptively satisfy the control and communication time constraints as well as the processing time constraints in FSO-based 6G UAV aerial networks. The proposed scheme controls the relation between the number of data flows, input data rate, number of worker nodes considering the time bounds, and the errors that occur during communication and data processing. The simulation results show that the proposed scheme is very effective in satisfying the time constraints for UAV control and radio access network services, even when errors in communication and data processing may occur.