• Title/Summary/Keyword: Apache Storm

Search Result 12, Processing Time 0.031 seconds

Reconfiguration of Apache Storm for InfiniBand Communications (InfiniBand RDMA 통신을 위한 Apache Storm의 재구성)

  • Yang, Seokwoo;Son, Siwoon;Moon, Yang-Sae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.8
    • /
    • pp.297-306
    • /
    • 2018
  • In this paper, we address how to apply Apache Storm, a distributed stream processing framework, to InfiniBand, a high performance communication device. An easy way to run Storm on InfiniBand is to simply use IPoIP (IP over InfiniBand). However, this method causes a serious CPU load on the node, which is caused by frequent context switches and buffer copies. To solve this problem, we propose a new communication method using InfiniBand's Remote Direct Memory Access (RDMA) function in Storm. First, we design and implement RJ-Netty (RDMA/JXIO Netty), a new framework that replaces Netty, the legacy framework, to exploit RDMA functionality. Second, we reimplement the related classes so that Storm can use both existing Netty and new RJ-Netty. Third, we extend the JXIO server functionality so as to support multi-threading to maximize the performance of RJ-Netty. Experimental results show that the proposed RJ-Netty significantly reduces CPU load while improving message throughput compared to IPoIB as well as Ethernet. This paper is the first attempt to run Apache Storm on InfiniBand, and we believe that it is an excellent research result that improves the performance of Storm by using InfiniBand RDMA.

Efficient Locality-Aware Traffic Distribution in Apache Storm (Apache Storm에서 지역성을 고려한 효율적인 트래픽 분배)

  • Son, Siwoon;Lee, Sanghun;Moon, Yang-Sae
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.12
    • /
    • pp.677-683
    • /
    • 2017
  • Apache Storm is a representative real-time distributed processing system, which is able to process data streams quickly over distributed servers. Storm currently provides several stream grouping methods to distribute data traffic to multiple servers. Among them, the shuffle grouping may cause a processing delay problem and the local-or-shuffle grouping used to solve the problem may cause the problem of concentrating the traffic on a specific node. In this paper, we propose the locality-aware grouping to solve the problems that may arise in the existing Storm grouping methods. Experimental results show that the proposed locality-aware grouping is considerably superior to the existing shuffle grouping and the local-or-shuffle grouping. These results show that the new grouping is an excellent approach considering both the locality and load balancing which are limitations of the existing Storm.

Design of InfiniBand RDMA-based Network Structure of Apache Storm (InfiniBand RDMA 기반 Apache Storm의 네트워크 구조 설계)

  • Yang, Seokwoo;Son, Siwoon;Choi, Seong-Yun;Choi, Mi-Jung;Moon, Yang-Sae
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.11a
    • /
    • pp.679-681
    • /
    • 2017
  • Apache Storm은 대용량 데이터 스트림을 처리하기 위한 실시간 분산 병렬 처리 프레임워크이며, 이를 사용해 다수의 프로세스 및 스레드를 동시에 동작시킬 수 있다. 하지만, 이러한 멀티 프로세스 및 스레드 환경을 제공하는 Storm은 많은 네트워크 시스템 호출을 수행하고, 이는 잦은 문맥 전환(context switch), 운영체제로의 버퍼 복사, 운영체제 내의 버퍼 복사 등으로 인해 CPU 과부하 문제를 발생시킬 수 있다. 이러한 문제는 고성능 네트워크 장비인 InfiniBand의 IPoIB(IP over InfiniBand) 통신을 사용할 때, InfiniBand가 지원하는 대역폭(bandwidth) 대비 저용량 데이터의 송수신으로 인해 더 잦은 문맥 전환과 버퍼 복사가 발생하여 CPU 과부하 문제가 더욱 심각해진다. 따라서, 본 논문에서는 InfiniBand의 RDMA(Remote Direct Memory Access)를 Storm에 적용하는 설계안을 제시함으로써 CPU 과부하 문제를 해결한다.

A Distributed Real-time Self-Diagnosis System for Processing Large Amounts of Log Data (대용량 로그 데이터 처리를 위한 분산 실시간 자가 진단 시스템)

  • Son, Siwoon;Kim, Dasol;Moon, Yang-Sae;Choi, Hyung-Jin
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.58-68
    • /
    • 2018
  • Distributed computing helps to efficiently store and process large data on a cluster of multiple machines. The performance of distributed computing is greatly influenced depending on the state of the servers constituting the distributed system. In this paper, we propose a self-diagnosis system that collects log data in a distributed system, detects anomalies and visualizes the results in real time. First, we divide the self-diagnosis process into five stages: collecting, delivering, analyzing, storing, and visualizing stages. Next, we design a real-time self-diagnosis system that meets the goals of real-time, scalability, and high availability. The proposed system is based on Apache Flume, Apache Kafka, and Apache Storm, which are representative real-time distributed techniques. In addition, we use simple but effective moving average and 3-sigma based anomaly detection technique to minimize the delay of log data processing during the self-diagnosis process. Through the results of this paper, we can construct a distributed real-time self-diagnosis solution that can diagnose server status in real time in a complicated distributed system.

Monitoring Tools for Efficient Overload Measurements in Apache Kafka (Apache Kafka에서 효율적인 과부하 측정을 위한 모니터링 도구)

  • Bang, Jiwon;Son, Siwoon;Moon, Yang-Sae;Choi, Mi-Jung
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.11a
    • /
    • pp.52-54
    • /
    • 2017
  • 실시간으로 빠르게 발생하는 대용량 데이터를 다루기 위해 Apache Storm, Apache Spark 등 실시간 데이터 스트림 처리 기술에 대한 연구가 활발하다. 대부분의 실시간 처리 기술들은 단독으로 사용하기에 어려움이 있으며, 데이터 스트림의 입출력을 위해 메시징 시스템과 함께 사용하는 것이 일반적이다. Apache Kafka는 대표적인 분산 메시징 시스템으로써, 실시간으로 발생하는 대용량의 로그 데이터를 전달하는데 특화된 시스템이다. 현재 Kafka를 위한 다양한 성능 모니터링 도구들이 존재한다. 이러한 모니터링 도구들은 Kafka에서 처리되는 데이터의 양 이외에도 유입 데이터의 크기, 수집 속도, 처리 속도 등 다양한 데이터들을 관찰할 수 있다. 본 논문은 Kafka에서 제공하는 도구와 오픈 소스로 제공되는 여러 개의 도구들을 비교하여, 향후 Kafka의 로드 쉐딩에 대한 연구에 적용할 수 있는 최적의 모니터링 도구를 선별하고자 한다.

Storm-Based Dynamic Tag Cloud for Real-Time SNS Data (실시간 SNS 데이터를 위한 Storm 기반 동적 태그 클라우드)

  • Son, Siwoon;Kim, Dasol;Lee, Sujeong;Gil, Myeong-Seon;Moon, Yang-Sae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.6
    • /
    • pp.309-314
    • /
    • 2017
  • In general, there are many difficulties in collecting, storing, and analyzing SNS (social network service) data, since those data have big data characteristics, which occurs very fast with the mixture form of structured and unstructured data. In this paper, we propose a new data visualization framework that works on Apache Storm, and it can be useful for real-time and dynamic analysis of SNS data. Apache Storm is a representative big data software platform that processes and analyzes real-time streaming data in the distributed environment. Using Storm, in this paper we collect and aggregate the real-time Twitter data and dynamically visualize the aggregated results through the tag cloud. In addition to Storm-based collection and aggregation functionalities, we also design and implement a Web interface that a user gives his/her interesting keywords and confirms the visualization result of tag cloud related to the given keywords. We finally empirically show that this study makes users be able to intuitively figure out the change of the interested subject on SNS data and the visualized results be applied to many other services such as thematic trend analysis, product recommendation, and customer needs identification.

Apache Storm based Query Filtering System for Multivariate Data Streams (다변량 데이터 스트림을 위한 아파치 스톰 기반 질의 필터링 시스템)

  • Kim, Youngkuk;Son, Siwoon;Moon, Yang-Sae
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2018.10a
    • /
    • pp.561-564
    • /
    • 2018
  • 최근 빠르게 발생하는 빅데이터 스트림이 다양한 분야에서 활용되고 있다. 이러한 빅데이터 전체를 수집하고 처리하는 것은 매우 비경제적이므로, 데이터 스트림 중 필요한 데이터를 걸러내는 필터링 과정이 필요하다. 본 논문에서는 아파치 스톰(Apache Storm)을 사용하여 데이터 스트림의 질의 필터링 시스템을 구축한다. 스톰은 대용량 데이터 스트림을 처리하기 위한 실시간 분산 병렬 처리 프레임워크이다. 하지만, 스톰은 입력 데이터 구조나 알고리즘 변경 시, 코드의 수정과 재배포, 재시작 등이 필요하다. 따라서, 본 논문에서는 이 같은 문제를 해결하기 위해 아파치 카프카(Apache Kafka)를 사용하여 데이터 수집 모듈과 스톰의 처리 모듈을 분리함으로써 시스템의 가용성을 크게 높인다. 또한, 시스템을 웹 기반 클라이언트-서버 모델로 구현하여 사용자가 언제 어디에서든 질의 필터링 시스템을 사용할 수 있게 하며, 웹 클라이언트를 통해 입력한 질의를 자동적 분석하는 쿼리 파서를 구현하여 별도의 프로그램의 수정 없이 질의 필터링을 적용할 수 있다.

A Study on The Comparative Performance Analysis of Open Source Web Server Using JMeter (JMeter를 이용한 오픈소스 웹 서버 성능 비교 연구)

  • Yoo, Hyun-Dam;Kim, Yong-Hoon;Song, Chung-Geon;Kim, Hyeong-Eun;Choi, Byung-Jun
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2018.10a
    • /
    • pp.2-4
    • /
    • 2018
  • 본 연구에서는 웹 서버 성능 테스트 프로그램인 JMeter를 이용하여 대표적인 오픈소스 웹 서버인 Apache, Nginx, Cherokee, Monkey HTTP, Sand Storm의 성능을 비교 분석하였다. 실험 결과 파일 크기가 작은 경우에는 Lighttpd, 중간 크기인 경우에는 Cherokee, 큰 경우에는 Nginx가 좋은 성능을 보였다. 또한 클라이언트의 수를 증가시켰을 때 Cherokee가 상대적으로 가장 작은 성능 저하를, Lighttpd가 가장 큰 성능 저하를 보였다.

Squall: A Real-time Big Data Processing Framework based on TMO Model for Real-time Events and Micro-batch Processing (Squall: 실시간 이벤트와 마이크로-배치의 동시 처리 지원을 위한 TMO 모델 기반의 실시간 빅데이터 처리 프레임워크)

  • Son, Jae Gi;Kim, Jung Guk
    • Journal of KIISE
    • /
    • v.44 no.1
    • /
    • pp.84-94
    • /
    • 2017
  • Recently, the importance of velocity, one of the characteristics of big data (5V: Volume, Variety, Velocity, Veracity, and Value), has been emphasized in the data processing, which has led to several studies on the real-time stream processing, a technology for quick and accurate processing and analyses of big data. In this paper, we propose a Squall framework using Time-triggered Message-triggered Object (TMO) technology, a model that is widely used for processing real-time big data. Moreover, we provide a description of Squall framework and its operations under a single node. TMO is an object model that supports the non-regular real-time processing method for certain conditions as well as regular periodic processing for certain amount of time. A Squall framework can support the real-time event stream of big data and micro-batch processing with outstanding performances, as compared to Apache storm and Spark Streaming. However, additional development for processing real-time stream under multiple nodes that is common under most frameworks is needed. In conclusion, the advantages of a TMO model can overcome the drawbacks of Apache storm or Spark Streaming in the processing of real-time big data. The TMO model has potential as a useful model in real-time big data processing.

Real Time Distributed Parallel Processing to Visualize Noise Map with Big Sensor Data and GIS Data for Smart Cities (스마트시티의 빅 센서 데이터와 빅 GIS 데이터를 융합하여 실시간 온라인 소음지도로 시각화하기 위한 분산병렬처리 방법론)

  • Park, Jong-Won;Sim, Ye-Chan;Jung, Hae-Sun;Lee, Yong-Woo
    • Journal of Internet Computing and Services
    • /
    • v.19 no.4
    • /
    • pp.1-6
    • /
    • 2018
  • In smart cities, data from various kinds of sensors are collected and processed to provide smart services to the citizens. Noise information services with noise maps using the collected sensor data from various kinds of ubiquitous sensor networks is one of them. This paper presents a research result which generates three dimensional (3D) noise maps in real-time for smart cities. To make a noise map, we have to converge many informal data which include big image data of geographical Information and massive sensor data. Making such a 3D noise map in real-time requires the processing of the stream data from the ubiquitous sensor networks in real-time and the convergence operation in real-time. They are very challenging works. We developed our own methodology for real-time distributed and parallel processing for it and present it in this paper. Further, we developed our own real-time 3D noise map generation system, with the methodology. The system uses open source softwares for it. Here in this paper, we do introduce one of our systems which uses Apache Storm. We did performance evaluation using the developed system. Cloud computing was used for the performance evaluation experiments. It was confirmed that our system was working properly with good performance and the system can produce the 3D noise maps in real-time. The performance evaluation results are given in this paper, as well.