• Title/Summary/Keyword: Spark Streaming

Search Result 19, Processing Time 0.038 seconds

Techniques to Guarantee Real-Time Fault Recovery in Spark Streaming Based Cloud System (Spark Streaming 기반 클라우드 시스템에서 실시간 고장 복구를 지원하기 위한 기법들)

  • Kim, Jungho;Park, Daedong;Kim, Sangwook;Moon, Yongshik;Hong, Seongsoo
    • Journal of KIISE
    • /
    • v.44 no.5
    • /
    • pp.460-468
    • /
    • 2017
  • In a real-time cloud environment, the data analysis framework plays a pivotal role. Spark Streaming meets most real-time requirements among existing frameworks. However, the framework does not meet the second scale real-time fault recovery requirement. Spark Streaming fault recovery time increases in proportion to the transformation history length called lineage. This is because it recovers the last state data based on the cumulative lineage recorded during normal operation. Therefore, fault recovery time is not bounded within a limited time. In addition, it is impossible to achieve a second-scale fault recovery time because it costs tens of seconds to read initial state data from fault-tolerant storage. In this paper, we propose two techniques to solve the problems mentioned above. We apply the proposed techniques to Spark Streaming 1.6.2. Experimental results show that the fault recovery time is bounded and the average fault recovery time is reduced by up to 41.57%.

mproved Spark Streaming Scheduling Mechanism for Real-time Fault Recovery (장애 복구 응답성 향상을 위한 Spark Streaming 스케줄링 개선 메커니즘)

  • Hwang, Yongha;Noh, Soonhyun
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.07a
    • /
    • pp.3-6
    • /
    • 2018
  • 최근 방대한 양의 스트림 데이터가 생산되면서 이를 실시간으로 처리하기 위한 프레임워크가 등장하였으며, 오픈 소스 영역에서 Spark Streaming이 주목받고 있다. Spark Streaming은 분산 환경에서 성능 향상을 위해 지연 스케줄링을 기반으로 응용을 수행하지만, 장애 발생 시 사용 가능한 태스크 슬롯을 빠르게 할당받지 못할 경우 장애 복구 시간이 지연될 수 있다는 문제점이 있다. 이러한 문제점을 해결하기 위해 본 논문에서는 실행자의 태스크 슬롯 보장을 통해 대기 시간 없이 즉시 할당할 수 있도록 하는 개선 메커니즘을 제안하였고, 실험 결과 장애 복구 응답성이 39.14% 개선됨을 확인하였다.

  • PDF

Distributed Moving Objects Management System for a Smart Black Box

  • Lee, Hyunbyung;Song, Seokil
    • International Journal of Contents
    • /
    • v.14 no.1
    • /
    • pp.28-33
    • /
    • 2018
  • In this paper, we design and implement a distributed, moving objects management system for processing locations and sensor data from smart black boxes. The proposed system is designed and implemented based on Apache Kafka, Apache Spark & Spark Streaming, Hbase, HDFS. Apache Kafka is used to collect the data from smart black boxes and queries from users. Received location data from smart black boxes and queries from users becomes input of Apache Spark Streaming. Apache Spark Streaming preprocesses the input data for indexing. Recent location data and indexes are stored in-memory managed by Apache Spark. Old data and indexes are flushed into HBase later. We perform experiments to show the throughput of the index manager. Finally, we describe the implementation detail in Scala function level.

Continuos Query Method for Moving Objects using Grid Index based on Spark Streaming (Spark Streaming 기반의 그리드 색인을 이용하는 이동객체를 위한 연속 질의 기법)

  • Choi, Do-jin;Song, Seokil
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2015.05a
    • /
    • pp.67-68
    • /
    • 2015
  • 이 논문에서는 Spark Stream의 Discretized Streams 모델을 기반의 그리드 인덱스를 제안하고, 이를 이용한 이동객체를 위한 연속질의 기법을 제안한다. 제안하는 연속질의 처리 방법은 Spark 의 RDD 모델을 이용하여 그리드 색인 및 연속질의 목록을 구현하여, 시스템 고장 시에도 빠르게 복구할 수 있는 내 장애성을 확보 하였다.

  • PDF

Real-Time Stock Price Prediction using Apache Spark (Apache Spark를 활용한 실시간 주가 예측)

  • Dong-Jin Shin;Seung-Yeon Hwang;Jeong-Joon Kim
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.23 no.4
    • /
    • pp.79-84
    • /
    • 2023
  • Apache Spark, which provides the fastest processing speed among recent distributed and parallel processing technologies, provides real-time functions and machine learning functions. Although official documentation guides for these functions are provided, a method for fusion of functions to predict a specific value in real time is not provided. Therefore, in this paper, we conducted a study to predict the value of data in real time by fusion of these functions. The overall configuration is collected by downloading stock price data provided by the Python programming language. And it creates a model of regression analysis through the machine learning function, and predicts the adjusted closing price among the stock price data in real time by fusing the real-time streaming function with the machine learning function.

High Rate Denial-of-Service Attack Detection System for Cloud Environment Using Flume and Spark

  • Gutierrez, Janitza Punto;Lee, Kilhung
    • Journal of Information Processing Systems
    • /
    • v.17 no.4
    • /
    • pp.675-689
    • /
    • 2021
  • Nowadays, cloud computing is being adopted for more organizations. However, since cloud computing has a virtualized, volatile, scalable and multi-tenancy distributed nature, it is challenging task to perform attack detection in the cloud following conventional processes. This work proposes a solution which aims to collect web server logs by using Flume and filter them through Spark Streaming in order to only consider suspicious data or data related to denial-of-service attacks and reduce the data that will be stored in Hadoop Distributed File System for posterior analysis with the frequent pattern (FP)-Growth algorithm. With the proposed system, we can address some of the difficulties in security for cloud environment, facilitating the data collection, reducing detection time and consequently enabling an almost real-time attack detection.

An Abnormal Worker Movement Detection System Based on Data Stream Processing and Hierarchical Clustering

  • Duong, Dat Van Anh;Lan, Doi Thi;Yoon, Seokhoon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.14 no.4
    • /
    • pp.88-95
    • /
    • 2022
  • Detecting anomalies in human movement is an important task in industrial applications, such as monitoring industrial disasters or accidents and recognizing unauthorized factory intruders. In this paper, we propose an abnormal worker movement detection system based on data stream processing and hierarchical clustering. In the proposed system, Apache Spark is used for streaming the location data of people. A hierarchical clustering-based anomalous trajectory detection algorithm is designed for detecting anomalies in human movement. The algorithm is integrated into Apache Spark for detecting anomalies from location data. Specifically, the location information is streamed to Apache Spark using the message queuing telemetry transport protocol. Then, Apache Spark processes and stores location data in a data frame. When there is a request from a client, the processed data in the data frame is taken and put into the proposed algorithm for detecting anomalies. A real mobility trace of people is used to evaluate the proposed system. The obtained results show that the system has high performance and can be used for a wide range of industrial applications.

Distributed Indexing Methods for Moving Objects based on Spark Stream

  • Lee, Yunsou;Song, Seokil
    • International Journal of Contents
    • /
    • v.11 no.1
    • /
    • pp.69-72
    • /
    • 2015
  • Generally, existing parallel main-memory spatial index structures to avoid the trade-off between query freshness and CPU cost uses light-weight locking techniques. However, still, the lock based methods have some limits such as thrashing which is a well-known problem in lock based methods. In this paper, we propose a distributed index structure for moving objects exploiting the parallelism in multiple machines. The proposed index is a lock free multi-version concurrency technique based on the D-Stream model of Spark Stream. The proposed method exploits the multiversion nature of D-Stream of Spark Streaming.

Processing large-scale data with Apache Spark (Apache Spark를 활용한 대용량 데이터의 처리)

  • Ko, Seyoon;Won, Joong-Ho
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.6
    • /
    • pp.1077-1094
    • /
    • 2016
  • Apache Spark is a fast and general-purpose cluster computing package. It provides a new abstraction named resilient distributed dataset, which is capable of support for fault tolerance while keeping data in memory. This type of abstraction results in a significant speedup compared to legacy large-scale data framework, MapReduce. In particular, Spark framework is suitable for iterative machine learning applications such as logistic regression and K-means clustering, and interactive data querying. Spark also supports high level libraries for various applications such as machine learning, streaming data processing, database querying and graph data mining thanks to its versatility. In this work, we introduce the concept and programming model of Spark as well as show some implementations of simple statistical computing applications. We also review the machine learning package MLlib, and the R language interface SparkR.

Design of Hybrid IDS(Intrusion Detection System) Log Analysis System based on Hadoop and Spark (Hadoop과 Spark를 이용한 실시간 Hybrid IDS 로그 분석 시스템에 대한 설계)

  • Yoo, Ji-Hoon;Yooun, Hosang;Shin, Dongil;Shin, Dongkyoo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.04a
    • /
    • pp.217-219
    • /
    • 2017
  • 나날이 증가하는 해킹의 위협에 따라 이를 방어하기 위한 침임 탐지 시스템과 로그 수집 분야에서 많은 연구가 진행되고 있다. 이러한 연구들로 인해 다양한 종류의 침임 탐지 시스템이 생겨났으며, 이는 다양한 종류의 침입 탐지 시스템에서 서로의 단점을 보안할 필요성이 생기게 되었다. 따라서 본 논문에서는 네트워크 기반인 NIDS(Network-based IDS)와 호스트 기반인 HIDS(Host-based IDS)의 장단점을 가진 Hybrid IDS을 구성하기 위해 NIDS와 HIDS의 로그 데이터 통합을 위해 실시간 로그 처리에 특화된 Kafka를 이용하고, 실시간 분석에 Spark Streaming을 이용하여 통합된 로그를 분석하게 되며, 실시간 전송 도중에 발생되는 데이터 유실에 대해 별도로 저장되는 Hadoop의 HDFS에서는 데이터 유실에 대한 보장을 하는 실시간 Hybrid IDS 분석 시스템에 대한 설계를 제안한다.