• Title/Summary/Keyword: 하둡 서버

Search Result 29, Processing Time 0.027 seconds

Anomaly Detection Technique of Log Data Using Hadoop Ecosystem (하둡 에코시스템을 활용한 로그 데이터의 이상 탐지 기법)

  • Son, Siwoon;Gil, Myeong-Seon;Moon, Yang-Sae
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.2
    • /
    • pp.128-133
    • /
    • 2017
  • In recent years, the number of systems for the analysis of large volumes of data is increasing. Hadoop, a representative big data system, stores and processes the large data in the distributed environment of multiple servers, where system-resource management is very important. The authors attempted to detect anomalies from the rapid changing of the log data that are collected from the multiple servers using simple but efficient anomaly-detection techniques. Accordingly, an Apache Hive storage architecture was designed to store the log data that were collected from the multiple servers in the Hadoop ecosystem. Also, three anomaly-detection techniques were designed based on the moving-average and 3-sigma concepts. It was finally confirmed that all three of the techniques detected the abnormal intervals correctly, while the weighted anomaly-detection technique is more precise than the basic techniques. These results show an excellent approach for the detection of log-data anomalies with the use of simple techniques in the Hadoop ecosystem.

A Trend Analysis Service Using a Hadoop Cluster of Mini PCs (미니 PC 기반의 하둡 클러스터를 이용한 트렌드 분석 서비스)

  • Jeon, Young-Ho;Kim, Eun-Sang;Park, Hyo-Ju;Lee, Ki-Hoon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2015.04a
    • /
    • pp.710-711
    • /
    • 2015
  • IT 산업의 발전에 따라 생성되는 데이터의 양이 폭발적으로 증가하고 있다. 이러한 빅 데이터는 여러 대의 컴퓨터로 구성한 하둡 클러스터를 이용하면 상당히 빠른 속도로 처리할 수 있으나, 일반적으로 하둡 클러스터를 구성하기 위해 많은 비용과 공간이 소요되는 단점이 있다. 본 논문에서는 저가의 미니 PC로 하둡 클러스터를 구성하여 비용 및 공간적 문제점을 해결하고, 구축한 하둡 클러스터를 이용한 트렌드 분석 서비스를 제안하였다. 실험 결과 미니 PC로 이루어진 하둡 클러스터가 고가의 서버보다 트랜드 분석에 더 좋은 처리 성능을 보였다.

An elastic distributed parallel Hadoop system for bigdata platform and distributed inference engines (동적 분산병렬 하둡시스템 및 분산추론기에 응용한 서버가상화 빅데이터 플랫폼)

  • Song, Dong Ho;Shin, Ji Ae;In, Yean Jin;Lee, Wan Gon;Lee, Kang Se
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.5
    • /
    • pp.1129-1139
    • /
    • 2015
  • Inference process generates additional triples from knowledge represented in RDF triples of semantic web technology. Tens of million of triples as an initial big data and the additionally inferred triples become a knowledge base for applications such as QA(question&answer) system. The inference engine requires more computing resources to process the triples generated while inferencing. The additional computing resources supplied by underlying resource pool in cloud computing can shorten the execution time. This paper addresses an algorithm to allocate the number of computing nodes "elastically" at runtime on Hadoop, depending on the size of knowledge data fed. The model proposed in this paper is composed of the layered architecture: the top layer for applications, the middle layer for distributed parallel inference engine to process the triples, and lower layer for elastic Hadoop and server visualization. System algorithms and test data are analyzed and discussed in this paper. The model hast the benefit that rich legacy Hadoop applications can be run faster on this system without any modification.

A Study on Data Storage and Recovery in Hadoop Environment (하둡 환경에 적합한 데이터 저장 및 복원 기법에 관한 연구)

  • Kim, Su-Hyun;Lee, Im-Yeong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.2 no.12
    • /
    • pp.569-576
    • /
    • 2013
  • Cloud computing has been receiving increasing attention recently. Despite this attention, security is the main problem that still needs to be addressed for cloud computing. In general, a cloud computing environment protects data by using distributed servers for data storage. When the amount of data is too high, however, different pieces of a secret key (if used) may be divided among hundreds of distributed servers. Thus, the management of a distributed server may be very difficult simply in terms of its authentication, encryption, and decryption processes, which incur vast overheads. In this paper, we proposed a efficiently data storage and recovery scheme using XOR and RAID in Hadoop environment.

A Study on Security Improvement in Hadoop Distributed File System Based on Kerberos (Kerberos 기반 하둡 분산 파일 시스템의 안전성 향상방안)

  • Park, So Hyeon;Jeong, Ik Rae
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.23 no.5
    • /
    • pp.803-813
    • /
    • 2013
  • As the developments of smart devices and social network services, the amount of data has been exploding. The world is facing Big data era. For these reasons, the Big data processing technology which is a new technology that can handle such data has attracted much attention. One of the most representative technologies is Hadoop. Hadoop Distributed File System(HDFS) designed to run on commercial Linux server is an open source framework and can store many terabytes of data. The initial version of Hadoop did not consider security because it only focused on efficient Big data processing. As the number of users rapidly increases, a lot of sensitive data including personal information were stored on HDFS. So Hadoop announced a new version that introduces Kerberos and token system in 2009. However, this system is vulnerable to the replay attack, impersonation attack and other attacks. In this paper, we analyze these vulnerabilities of HDFS security and propose a new protocol which complements these vulnerabilities and maintains the performance of Hadoop.

Data Transmitting and Storing Scheme based on Bandwidth in Hadoop Cluster (하둡 클러스터의 대역폭을 고려한 압축 데이터 전송 및 저장 기법)

  • Kim, Youngmin;Kim, Heejin;Kim, Younggwan;Hong, Jiman
    • Smart Media Journal
    • /
    • v.8 no.4
    • /
    • pp.46-52
    • /
    • 2019
  • The size of data generated and collected at industrial sites or in public institutions is growing rapidly. The existing data processing server often handles the increasing data by increasing the performance by scaling up. However, in the big data era, when the speed of data generation is exploding, there is a limit to data processing with a conventional server. To overcome such limitations, a distributed cluster computing system has been introduced that distributes data in a scale-out manner. However, because distributed cluster computing systems distribute data, inefficient use of network bandwidth can degrade the performance of the cluster as a whole. In this paper, we propose a scheme that compresses data when transmitting data in a Hadoop cluster considering network bandwidth. The proposed scheme considers the network bandwidth and the characteristics of the compression algorithm and selects the optimal compression transmission scheme before transmission. Experimental results show that the proposed scheme reduces data transfer time and size.

CPS Data Analysis Architecture using Open Source Projects (공개소스프로젝트를 이용한 사이버물리시스템 데이터분석아키텍처)

  • Lim, Yoojin;Choi, Eunmi
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.11a
    • /
    • pp.172-175
    • /
    • 2013
  • 사이버물리시스템(CPS)은 실시간 제약으로 타이밍에 민감한 특징이 있으며, 산업 영역에 적용시 시스템 동작과 안전필수 로그의 특정한 패턴을 나타내는 대용량의 실시간 데이터를 생성시킨다. 본 논문은 공개소스프로젝트인 하둡에코시스템을 이용한 CPS 데이터분석 아키텍처를 소개한다. CPS 처리의 특징 때문에 그 대용량의 데이터 처리는 하나의 머신에서 분석될 수 없으므로, 하둡에코시스템을 통하여 실시간 기반으로 생성되는 데이터를 저장하고 처리하는 시스템 아키텍처를 제안한다. 하둡분산파일시스템(HDFS)은 거대한 CPS 데이터의 저장을 위한 기본 파일시스템이고, 하이브는 데이터웨어하우징 처리를 위한 CPS 데이터분석에 사용된다. 플룸은 서버들로부터 데이터를 수집하고 HDFS에서 그 데이터를 처리하기 위해 사용되며, Rhive는 데이터 마이닝과 분석을 적용하기 위해 사용된다. 이러한 아키텍처를 개관하고, 또한 효과적인 데이터 분석을 위해 사용한 시스템 설계 전략을 소개한다.

Design of Distributed Hadoop Full Stack Platform for Big Data Collection and Processing (빅데이터 수집 처리를 위한 분산 하둡 풀스택 플랫폼의 설계)

  • Lee, Myeong-Ho
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.7
    • /
    • pp.45-51
    • /
    • 2021
  • In accordance with the rapid non-face-to-face environment and mobile first strategy, the explosive increase and creation of many structured/unstructured data every year demands new decision making and services using big data in all fields. However, there have been few reference cases of using the Hadoop Ecosystem, which uses the rapidly increasing big data every year to collect and load big data into a standard platform that can be applied in a practical environment, and then store and process well-established big data in a relational database. Therefore, in this study, after collecting unstructured data searched by keywords from social network services based on Hadoop 2.0 through three virtual machine servers in the Spring Framework environment, the collected unstructured data is loaded into Hadoop Distributed File System and HBase based on the loaded unstructured data, it was designed and implemented to store standardized big data in a relational database using a morpheme analyzer. In the future, research on clustering and classification and analysis using machine learning using Hive or Mahout for deep data analysis should be continued.

Big Data Management Scheme using Property Information based on Cluster Group in adopt to Hadoop Environment (하둡 환경에 적합한 클러스터 그룹 기반 속성 정보를 이용한 빅 데이터 관리 기법)

  • Han, Kun-Hee;Jeong, Yoon-Su
    • Journal of Digital Convergence
    • /
    • v.13 no.9
    • /
    • pp.235-242
    • /
    • 2015
  • Social network technology has been increasing interest in the big data service and development. However, the data stored in the distributed server and not on the central server technology is easy enough to find and extract. In this paper, we propose a big data management techniques to minimize the processing time of information you want from the content server and the management server that provides big data services. The proposed method is to link the in-group data, classified data and groups according to the type, feature, characteristic of big data and the attribute information applied to a hash chain. Further, the data generated to extract the stored data in the distributed server to record time for improving the data index information processing speed of the data classification of the multi-attribute information imparted to the data. As experimental result, The average seek time of the data through the number of cluster groups was increased an average of 14.6% and the data processing time through the number of keywords was reduced an average of 13%.

Analysis of the Influence Factors of Data Loading Performance Using Apache Sqoop (아파치 스쿱을 사용한 하둡의 데이터 적재 성능 영향 요인 분석)

  • Chen, Liu;Ko, Junghyun;Yeo, Jeongmo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.4 no.2
    • /
    • pp.77-82
    • /
    • 2015
  • Big Data technology has been attracted much attention in aspect of fast data processing. Research of practicing Big Data technology is also ongoing to process large-scale structured data much faster in Relatioinal Database(RDB). Although there are lots of studies about measuring analyzing performance, studies about structured data loading performance, prior step of analyzing, is very rare. Thus, in this study, structured data in RDB is tested the performance that loads distributed processing platform Hadoop using Apache sqoop. Also in order to analyze the influence factors of data loading, it is tested repeatedly with different options of data loading and compared with data loading performance among RDB based servers. Although data loading performance of Apache Sqoop in test environment was low, but in large-scale Hadoop cluster environment we can expect much better performance because of getting more hardware resources. It is expected to be based on study improving data loading performance and whole steps of performance analyzing structured data in Hadoop Platform.