• Title/Summary/Keyword: Hadoop Environment

Search Result 91, Processing Time 0.022 seconds

Performance Comparison of Spatial Split Algorithms for Spatial Data Analysis on Spark (Spark 기반 공간 분석에서 공간 분할의 성능 비교)

  • Yang, Pyoung Woo;Yoo, Ki Hyun;Nam, Kwang Woo
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.25 no.1
    • /
    • pp.29-36
    • /
    • 2017
  • In this paper, we implement a spatial big data analysis prototype based on Spark which is an in-memory system and compares the performance by the spatial split algorithm on this basis. In cluster computing environments, big data is divided into blocks of a certain size order to balance the computing load of big data. Existing research showed that in the case of the Hadoop based spatial big data system, the split method by spatial is more effective than the general sequential split method. Hadoop based spatial data system stores raw data as it is in spatial-divided blocks. However, in the proposed Spark-based spatial analysis system, there is a difference that spatial data is converted into a memory data structure and stored in a spatial block for search efficiency. Therefore, in this paper, we propose an in-memory spatial big data prototype and a spatial split block storage method. Also, we compare the performance of existing spatial split algorithms in the proposed prototype. We presented an appropriate spatial split strategy with the Spark based big data system. In the experiment, we compared the query execution time of the spatial split algorithm, and confirmed that the BSP algorithm shows the best performance.

A study on Digital Agriculture Data Curation Service Plan for Digital Agriculture

  • Lee, Hyunjo;Cho, Han-Jin;Chae, Cheol-Joo
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.2
    • /
    • pp.171-177
    • /
    • 2022
  • In this paper, we propose a service method that can provide insight into multi-source agricultural data, way to cluster environmental factor which supports data analysis according to time flow, and curate crop environmental factors. The proposed curation service consists of four steps: collection, preprocessing, storage, and analysis. First, in the collection step, the service system collects and organizes multi-source agricultural data by using an OpenAPI-based web crawler. Second, in the preprocessing step, the system performs data smoothing to reduce the data measurement errors. Here, we adopt the smoothing method for each type of facility in consideration of the error rate according to facility characteristics such as greenhouses and open fields. Third, in the storage step, an agricultural data integration schema and Hadoop HDFS-based storage structure are proposed for large-scale agricultural data. Finally, in the analysis step, the service system performs DTW-based time series classification in consideration of the characteristics of agricultural digital data. Through the DTW-based classification, the accuracy of prediction results is improved by reflecting the characteristics of time series data without any loss. As a future work, we plan to implement the proposed service method and apply it to the smart farm greenhouse for testing and verification.

Design of GlusterFS Based Big Data Distributed Processing System in Smart Factory (스마트 팩토리 환경에서의 GlusterFS 기반 빅데이터 분산 처리 시스템 설계)

  • Lee, Hyeop-Geon;Kim, Young-Woon;Kim, Ki-Young;Choi, Jong-Seok
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.1
    • /
    • pp.70-75
    • /
    • 2018
  • Smart Factory is an intelligent factory that can enhance productivity, quality, customer satisfaction, etc. by applying information and communications technology to the entire production process including design & development, manufacture, and distribution & logistics. The precise amount of data generated in a smart factory varies depending on the factory's size and state of facilities. Regardless, it would be difficult to apply traditional production management systems to a smart factory environment, as it generates vast amounts of data. For this reason, the need for a distributed big-data processing system has risen, which can process a large amount of data. Therefore, this article has designed a Gluster File System (GlusterFS)-based distributed big-data processing system that can be used in a smart factory environment. Compared to existing distributed processing systems, the proposed distributed big-data processing system reduces the system load and the risk of data loss through the distribution and management of network traffic.

An Efficient Data Replacement Algorithm for Performance Optimization of MapReduce in Non-dedicated Distributed Computing Environments (비-전용 분산 컴퓨팅 환경에서 맵-리듀스 처리 성능 최적화를 위한 효율적인 데이터 재배치 알고리즘)

  • Ryu, Eunkyung;Son, Ingook;Park, Junho;Bok, Kyoungsoo;Yoo, Jaesoo
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.9
    • /
    • pp.20-27
    • /
    • 2013
  • In recently years, with the growth of social media and the development of mobile devices, the data have been significantly increased. MapReduce is an emerging programming model that processes large amount of data. However, since MapReduce evenly places the data in the dedicated distributed computing environment, it is not suitable to the non-dedicated distributed computing environment. The data replacement algorithms were proposed for performance optimization of MapReduce in the non-dedicated distributed computing environments. However, they spend much time for date replacement and cause the network load for unnecessary data transmission. In this paper, we propose an efficient data replacement algorithm for the performance optimization of MapReduce in the non-dedicated distributed computing environments. The proposed scheme computes the ratio of data blocks in the nodes based on the node availability model and reduces the network load by transmitting the data blocks considering the data placement. Our experimental results show that the proposed scheme outperforms the existing scheme.

Development of Information Technology Infrastructures through Construction of Big Data Platform for Road Driving Environment Analysis (도로 주행환경 분석을 위한 빅데이터 플랫폼 구축 정보기술 인프라 개발)

  • Jung, In-taek;Chong, Kyu-soo
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.3
    • /
    • pp.669-678
    • /
    • 2018
  • This study developed information technology infrastructures for building a driving environment analysis platform using various big data, such as vehicle sensing data, public data, etc. First, a small platform server with a parallel structure for big data distribution processing was developed with H/W technology. Next, programs for big data collection/storage, processing/analysis, and information visualization were developed with S/W technology. The collection S/W was developed as a collection interface using Kafka, Flume, and Sqoop. The storage S/W was developed to be divided into a Hadoop distributed file system and Cassandra DB according to the utilization of data. Processing S/W was developed for spatial unit matching and time interval interpolation/aggregation of the collected data by applying the grid index method. An analysis S/W was developed as an analytical tool based on the Zeppelin notebook for the application and evaluation of a development algorithm. Finally, Information Visualization S/W was developed as a Web GIS engine program for providing various driving environment information and visualization. As a result of the performance evaluation, the number of executors, the optimal memory capacity, and number of cores for the development server were derived, and the computation performance was superior to that of the other cloud computing.

Implementation of Data processing of the High Availability for Software Architecture of the Cloud Computing (클라우드 서비스를 위한 고가용성 대용량 데이터 처리 아키텍쳐)

  • Lee, Byoung-Yup;Park, Junho;Yoo, Jaesoo
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.2
    • /
    • pp.32-43
    • /
    • 2013
  • These days, there are more and more IT research institutions which foresee cloud services as the predominant IT service in the near future and there, in fact, are actual cloud services provided by some IT leading vendors. Regardless of physical location of the service and environment of the system, cloud service can provide users with storage services, usage of data and software. On the other hand, cloud service has challenges as well. Even though cloud service has its edge in terms of the extent to which the IT resource can be freely utilized regardless of the confinement of hardware, the availability is another problem to be solved. Hence, this paper is dedicated to tackle the aforementioned issues; prerequisites of cloud computing for distributed file system, open source based Hadoop distributed file system, in-memory database technology and high availability database system. Also the author tries to body out the high availability mass distributed data management architecture in cloud service's perspective using currently used distributed file system in cloud computing market.

Parallelization of Genome Sequence Data Pre-Processing on Big Data and HPC Framework (빅데이터 및 고성능컴퓨팅 프레임워크를 활용한 유전체 데이터 전처리 과정의 병렬화)

  • Byun, Eun-Kyu;Kwak, Jae-Hyuck;Mun, Jihyeob
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.8 no.10
    • /
    • pp.231-238
    • /
    • 2019
  • Analyzing next-generation genome sequencing data in a conventional way using single server may take several tens of hours depending on the data size. However, in order to cope with emergency situations where the results need to be known within a few hours, it is required to improve the performance of a single genome analysis. In this paper, we propose a parallelized method for pre-processing genome sequence data which can reduce the analysis time by utilizing the big data technology and the highperformance computing cluster which is connected to the high-speed network and shares the parallel file system. For the reliability of analytical data, we have chosen a strategy to parallelize the existing analytical tools and algorithms to the new environment. Parallelized processing, data distribution, and parallel merging techniques have been developed and performance improvements have been confirmed through experiments.

Estimation of ship operational efficiency from AIS data using big data technology

  • Kim, Seong-Hoon;Roh, Myung-Il;Oh, Min-Jae;Park, Sung-Woo;Kim, In-Il
    • International Journal of Naval Architecture and Ocean Engineering
    • /
    • v.12 no.1
    • /
    • pp.440-454
    • /
    • 2020
  • To prevent pollution from ships, the Energy Efficiency Design Index (EEDI) is a mandatory guideline for all new ships. The Ship Energy Efficiency Management Plan (SEEMP) has also been applied by MARPOL to all existing ships. SEEMP provides the Energy Efficiency Operational Indicator (EEOI) for monitoring the operational efficiency of a ship. By monitoring the EEOI, the shipowner or operator can establish strategic plans, such as routing, hull cleaning, decommissioning, new building, etc. The key parameter in calculating EEOI is Fuel Oil Consumption (FOC). It can be measured on board while a ship is operating. This means that only the shipowner or operator can calculate the EEOI of their own ships. If the EEOI can be calculated without the actual FOC, however, then the other stakeholders, such as the shipbuilding company and Class, or others who don't have the measured FOC, can check how efficiently their ships are operating compared to other ships. In this study, we propose a method to estimate the EEOI without requiring the actual FOC. The Automatic Identification System (AIS) data, ship static data, and environment data that can be publicly obtained are used to calculate the EEOI. Since the public data are of large capacity, big data technologies, specifically Hadoop and Spark, are used. We verify the proposed method using actual data, and the result shows that the proposed method can estimate EEOI from public data without actual FOC.

Digital Forensic Investigation of HBase (HBase에 대한 디지털 포렌식 조사 기법 연구)

  • Park, Aran;Jeong, Doowon;Lee, Sang Jin
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.6 no.2
    • /
    • pp.95-104
    • /
    • 2017
  • As the technology in smart device is growing and Social Network Services(SNS) are becoming more common, the data which is difficult to be processed by existing RDBMS are increasing. As a result of this, NoSQL databases are getting popular as an alternative for processing massive and unstructured data generated in real time. The demand for the technique of digital investigation of NoSQL databases is increasing as the businesses introducing NoSQL database in their system are increasing, although the technique of digital investigation of databases has been researched centered on RDMBS. New techniques of digital forensic investigation are needed as NoSQL Database has no schema to normalize and the storage method differs depending on the type of database and operation environment. Research on document-based database of NoSQL has been done but it is not applicable as itself to other types of NoSQL Database. Therefore, the way of operation and data model, grasp of operation environment, collection and analysis of artifacts and recovery technique of deleted data in HBase which is a NoSQL column-based database are presented in this paper. Also the proposed technique of digital forensic investigation to HBase is verified by an experimental scenario.

Study of In-Memory based Hybrid Big Data Processing Scheme for Improve the Big Data Processing Rate (빅데이터 처리율 향상을 위한 인-메모리 기반 하이브리드 빅데이터 처리 기법 연구)

  • Lee, Hyeopgeon;Kim, Young-Woon;Kim, Ki-Young
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.12 no.2
    • /
    • pp.127-134
    • /
    • 2019
  • With the advancement of IT technology, the amount of data generated has been growing exponentially every year. As an alternative to this, research on distributed systems and in-memory based big data processing schemes has been actively underway. The processing power of traditional big data processing schemes enables big data to be processed as fast as the number of nodes and memory capacity increases. However, the increase in the number of nodes inevitably raises the frequency of failures in a big data infrastructure environment, and infrastructure management points and infrastructure operating costs also increase accordingly. In addition, the increase in memory capacity raises infrastructure costs for a node configuration. Therefore, this paper proposes an in-memory-based hybrid big data processing scheme for improve the big data processing rate. The proposed scheme reduces the number of nodes compared to traditional big data processing schemes based on distributed systems by adding a combiner step to a distributed system processing scheme and applying an in-memory based processing technology at that step. It decreases the big data processing time by approximately 22%. In the future, realistic performance evaluation in a big data infrastructure environment consisting of more nodes will be required for practical verification of the proposed scheme.