• Title/Summary/Keyword: Distributed Data Processing

Search Result 959, Processing Time 0.031 seconds

Optimization and Performance Analysis of Cloud Computing Platform for Distributed Processing of Big Data (대용량 데이터의 분산 처리를 위한 클라우드 컴퓨팅 환경 최적화 및 성능평가)

  • Hong, Seung-Tae;Shin, Young-Sung;Chang, Jae-Woo
    • Spatial Information Research
    • /
    • v.19 no.4
    • /
    • pp.55-71
    • /
    • 2011
  • Recently, interest in cloud computing which provides IT resources as service form in IT field is increasing. As a result, much research has been done on the distributed data processing that store and manage a large amount of data in many servers. Meanwhile, in order to effectively utilize the spatial data which is rapidly increasing day by day with the growth of GIS technology, distributed processing of spatial data using cloud computing is essential. Therefore, in this paper, we review the representative distributed data processing techniques and we analyze the optimization requirements for performance improvement of the distributed processing techniques for a large amount of data. In addition, we uses the Hadoop and we evaluate the performance of the distributed data processing techniques for their optimization requirements.

Design of Distributed Processing Framework Based on H-RTGL One-class Classifier for Big Data (빅데이터를 위한 H-RTGL 기반 단일 분류기 분산 처리 프레임워크 설계)

  • Kim, Do Gyun;Choi, Jin Young
    • Journal of Korean Society for Quality Management
    • /
    • v.48 no.4
    • /
    • pp.553-566
    • /
    • 2020
  • Purpose: The purpose of this study was to design a framework for generating one-class classification algorithm based on Hyper-Rectangle(H-RTGL) in a distributed environment connected by network. Methods: At first, we devised one-class classifier based on H-RTGL which can be performed by distributed computing nodes considering model and data parallelism. Then, we also designed facilitating components for execution of distributed processing. In the end, we validate both effectiveness and efficiency of the classifier obtained from the proposed framework by a numerical experiment using data set obtained from UCI machine learning repository. Results: We designed distributed processing framework capable of one-class classification based on H-RTGL in distributed environment consisting of physically separated computing nodes. It includes components for implementation of model and data parallelism, which enables distributed generation of classifier. From a numerical experiment, we could observe that there was no significant change of classification performance assessed by statistical test and elapsed time was reduced due to application of distributed processing in dataset with considerable size. Conclusion: Based on such result, we can conclude that application of distributed processing for generating classifier can preserve classification performance and it can improve the efficiency of classification algorithms. In addition, we suggested an idea for future research directions of this paper as well as limitation of our work.

Ontology data processing method in distributed semantic web environment (분산 시맨틱웹 환경에서의 온톨로지 데이터 처리 기법 연구)

  • Kim, Byung-Gon;Oh, Sung-Kyun
    • Journal of Digital Contents Society
    • /
    • v.9 no.2
    • /
    • pp.277-284
    • /
    • 2008
  • As the increasing of users' request about internet web service, the importance of ontology to construct semantic web is increasing now. Early Internet data processing was studied in the form of data integration through centralized ontology construction. However, because of distributed environment of internet, when integrating data of distributed site, it is required to integrate data of each site in terms of peer-to-peer data processing for corresponding to fast change of internet. In this paper, in distributed environment, we propose data processing method which construct ontology in each site with ontology language OWL. Furthermore, through relational representation of OWL, we propose the system containing distributed query processing for data constructed in different site with different method.

  • PDF

Distributed Data Processing for Bigdata Analysis in War Game Simulation Environment (워게임 시뮬레이션 환경에 맞는 빅데이터 분석을 위한 분산처리기술)

  • Bae, Minsu
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.73-83
    • /
    • 2019
  • Since the emergence of the fourth industrial revolution, data analysis is being conducted in various fields. Distributed data processing has already become essential for the fast processing of large amounts of data. However, in the defense sector, simulation used cannot fully utilize the unstructured data which are prevailing at real environments. In this study, we propose a distributed data processing platform that can be applied to battalion level simulation models to provide visualized data for command decisions during training. 500,000 data points of strategic game were analyzed. Considering the winning factors in the data, variance processing was conducted to analyze the data for the top 10% teams. With the increase in the number of nodes, the model becomes scalable.

  • PDF

Design of GlusterFS Based Big Data Distributed Processing System in Smart Factory (스마트 팩토리 환경에서의 GlusterFS 기반 빅데이터 분산 처리 시스템 설계)

  • Lee, Hyeop-Geon;Kim, Young-Woon;Kim, Ki-Young;Choi, Jong-Seok
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.1
    • /
    • pp.70-75
    • /
    • 2018
  • Smart Factory is an intelligent factory that can enhance productivity, quality, customer satisfaction, etc. by applying information and communications technology to the entire production process including design & development, manufacture, and distribution & logistics. The precise amount of data generated in a smart factory varies depending on the factory's size and state of facilities. Regardless, it would be difficult to apply traditional production management systems to a smart factory environment, as it generates vast amounts of data. For this reason, the need for a distributed big-data processing system has risen, which can process a large amount of data. Therefore, this article has designed a Gluster File System (GlusterFS)-based distributed big-data processing system that can be used in a smart factory environment. Compared to existing distributed processing systems, the proposed distributed big-data processing system reduces the system load and the risk of data loss through the distribution and management of network traffic.

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment (분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘)

  • Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.7
    • /
    • pp.667-680
    • /
    • 2016
  • As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.

A Quality Evaluation Model for Distributed Processing Systems of Big Data (빅데이터 분산처리시스템의 품질평가모델)

  • Choi, Seung-Jun;Park, Jea-Won;Kim, Jong-Bae;Choi, Jae-Hyun
    • Journal of Digital Contents Society
    • /
    • v.15 no.4
    • /
    • pp.533-545
    • /
    • 2014
  • According to the evolving of IT technologies, the amount of data we are facing increasing exponentially. Thus, the technique for managing and analyzing these vast data that has emerged is a distributed processing system of big data. A quality evaluation for the existing distributed processing systems has been proceeded by the structured data environment. Thus, if we apply this to the evaluation of distributed processing systems of big data which has to focus on the analysis of the unstructured data, a precise quality assessment cannot be made. Therefore, a study of the quality evaluation model for the distributed processing systems is needed, which considers the environment of the analysis of big data. In this paper, we propose a new quality evaluation model by deriving the quality evaluation elements based on the ISO/IEC9126 which is the international standard on software quality, and defining metrics for validating the elements.

A holistic distributed clustering algorithm based on sensor network (센서 네트워크 기반의 홀리스틱 분산 클러스터링 알고리즘)

  • Chen Ping;Kee-Wook Rim;Nam Ji-Yeun;Lee KyungOh
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2008.11a
    • /
    • pp.874-877
    • /
    • 2008
  • Nowadays the existing data processing systems can only support some simple query for sensor network. It is increasingly important to process the vast data streams in sensor network, and achieve effective acknowledges for users. In this paper, we propose a holistic distributed k-means algorithm for sensor network. In order to verify the effectiveness of this method, we compare it with central k-means algorithm to process the data streams in sensor network. From the evaluation experiments, we can verify that the proposed algorithm is highly capable of processing vast data stream with less computation time. This algorithm prefers to cluster the data streams at the distributed nodes, and therefore it largely reduces redundant data communications compared to the central processing algorithm.

Spatial Operation Allocation Scheme over Common Query Regions for Distributed Spatial Data Stream Processing (분산 공간 데이터 스트림 처리에서 질의 영역의 겹침을 고려한 공간 연산 배치 기법)

  • Chung, Weon-Il
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.13 no.6
    • /
    • pp.2713-2719
    • /
    • 2012
  • According to increasing of various location-based services, distributed data stream processing techniques have been widely studied to provide high scalability and availability. In previous researches, in order to balance the load of distributed nodes, the geographic characteristics of spatial data stream are not considered. For this reason, distributed operations for adjacent spatial regions increases the overall system load. We propose a operation allocation scheme considering the characteristics of spatial operations to effectively processing spatial data stream in distributed computing environments. The proposed method presents the efficient share maximizing approach that preferentially distributes spatial operations sharing the common query regions to the same node in order to separate the adjacent spatial operations on overlapped regions.

Design of Distributed Cloud System for Managing large-scale Genomic Data

  • Seine Jang;Seok-Jae Moon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.2
    • /
    • pp.119-126
    • /
    • 2024
  • The volume of genomic data is constantly increasing in various modern industries and research fields. This growth presents new challenges and opportunities in terms of the quantity and diversity of genetic data. In this paper, we propose a distributed cloud system for integrating and managing large-scale gene databases. By introducing a distributed data storage and processing system based on the Hadoop Distributed File System (HDFS), various formats and sizes of genomic data can be efficiently integrated. Furthermore, by leveraging Spark on YARN, efficient management of distributed cloud computing tasks and optimal resource allocation are achieved. This establishes a foundation for the rapid processing and analysis of large-scale genomic data. Additionally, by utilizing BigQuery ML, machine learning models are developed to support genetic search and prediction, enabling researchers to more effectively utilize data. It is expected that this will contribute to driving innovative advancements in genetic research and applications.