• Title/Summary/Keyword: 빅데이터 클러스터

Search Result 93, Processing Time 0.03 seconds

Load Balancing for Distributed Processing of Real-time Spatial Big Data Stream (실시간 공간 빅데이터 스트림 분산 처리를 위한 부하 균형화 방법)

  • Yoon, Susik;Lee, Jae-Gil
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1209-1218
    • /
    • 2017
  • A variety of sensors is widely used these days, and it has become much easier to acquire spatial big data streams from various sources. Since spatial data streams have inherently skewed and dynamically changing distributions, the system must effectively distribute the load among workers. Previous studies to solve this load imbalance problem are not directly applicable to processing spatial data. In this research, we propose Adaptive Spatial Key Grouping (ASKG). The main idea of ASKG is, by utilizing the previous distribution of the data streams, to adaptively suggest a new grouping scheme that evenly distributes the future load among workers. We evaluate the validity of the proposed algorithm in various environments, by conducting an experiment with real datasets while varying the number of workers, input rate, and processing overhead. Compared to two other alternative algorithms, ASKG improves the system performance in terms of load imbalance, throughput, and latency.

Development of Multidimensional Analysis System for Bio-pathways (바이오 패스웨이 다차원 분석 시스템 개발)

  • Seo, Dongmin;Choi, Yunsoo;Jeon, Sun-Hee;Lee, Min-Ho
    • The Journal of the Korea Contents Association
    • /
    • v.14 no.11
    • /
    • pp.467-475
    • /
    • 2014
  • With the development of genomics, wearable device and IT/NT, a vast amount of bio-medical data are generated recently. Also, healthcare industries based on big-data are booming and big-data technology based on bio-medical data is rising rapidly as a core technology for improving the national health and aged society. A pathway is the biological deep knowledge that represents the relations of dynamics and interaction among proteins, genes and cells by a network. A pathway is wildly being used as an important part of a bio-medical big-data analysis. However, a pathway analysis requires a lot of time and effort because a pathway is very diverse and high volume. Also, multidimensional analysis systems for various pathways are nonexistent even now. In this paper, we proposed a pathway analysis system that collects user interest pathways from KEGG pathway database that supports the most widely used pathways, constructs a network based on a hierarchy structure of pathways and analyzes the relations of dynamics and interaction among pathways by clustering and selecting core pathways from the network. Finally, to verify the superiority of our pathway analysis system, we evaluate the performance of our system in various experiments.

Development of Retargetable Hadoop Simulation Environment Based on DEVS Formalism (DEVS 형식론 기반의 재겨냥성 하둡 시뮬레이션 환경 개발)

  • Kim, Byeong Soo;Kang, Bong Gu;Kim, Tag Gon;Song, Hae Sang
    • Journal of the Korea Society for Simulation
    • /
    • v.26 no.4
    • /
    • pp.51-61
    • /
    • 2017
  • Hadoop platform is a representative storing and managing platform for big data. Hadoop consists of distributed computing system called MapReduce and distributed file system called HDFS. It is important to analyse the effectiveness according to the change of cluster constructions and several parameters. However, since it is hard to construct thousands of clusters and analyse the constructed system, simulation method is required to analyse the system. This paper proposes Hadoop simulator based on DEVS formalism which provides hierarchical and modular modeling. Hadoop simulator provides a retargetable experimental environment that is possible to change of various parameters, algorithms and models. It is also possible to design input models reflecting the characteristics of Hadoop applications. To maximize the user's convenience, the user interface, real-time model viewer, and input scenario editor are also provided. In this paper, we validate Hadoop Simulator through the comparison with the Hadoop execution results and perform various experiments.

A Study On Clusters and Ecosystem In Distribution Industry Using Big Data Analysis (빅데이타 분석을 통한 유통산업 클러스터의 형성과 생태계 연구)

  • Jung, Jaeheon
    • The Journal of the Korea Contents Association
    • /
    • v.19 no.7
    • /
    • pp.360-375
    • /
    • 2019
  • This paper tries to study the ecosystem after constructing the network of the continuing transactions associated with distribution industry with the data of more than 50 thousands firms provided by the Korean enterprise data (KED) for 2015. After applying the clustering method, one of social network analysis tools, we find the firms in the network grouped into 732 clusters occupying about 80% of whole distribution industry sales in KED data. The firms in a cluster have most of their transactions with other firms in the cluster. But the clusters have smaller firm numbers in the cluster and sales portion of the biggest firms in the industry than the case of the manufacturing industry. The Input-output analysis for the biggest distribution firms show that the small and medium size enterprise(SME)s have very high sale dependency on a main firm in some clusters. This fact implies more efficient fair transaction policies within the clusters. And small number of big distribution firms have very high rear production linkage effects on SMEs or on the 10th or 31th group with high portion of SME employment. They should be considered important in the SME growth and employment policies.

A Study On the Clusters In the Electronic Industry Using Social Network Analysis (사회적 네트워크 분석을 이용한 전자산업 클러스터 연구)

  • Jung, Jaeheon
    • The Journal of the Korea Contents Association
    • /
    • v.19 no.5
    • /
    • pp.48-63
    • /
    • 2019
  • We tried new analysis including social network analysis(SNA) on the transaction network centered on electronic companies using more than 50 thousand company transaction data obtained from Korean enterprise data (KED) for the year of 2015. We found 97 clusters having more than 10 firms and remarkable 13 clusters having more than 90% sales of the electronic industry in Korea. Clusters are the groups of companies having most of their transactions in the clusters they belong to. We found 5 clusters have 83% of sales in the electronic industry. Most of clusters have main single firms having most of the sales in each clusters except a few clusters. However, we found a few firms to have high rear production linkage effect and found the firms with high linkage effect specially for the small and medium size enterprise (SME). The companies with high production linkage (specially on SMEs) should be managed in terms of (SME) growth policy. The last firm group consisting of the small clusters with less than 10 firms has high employment coefficients. The clusters or company having high production linkage effect on this last firm group should be noted in the terms of employment policy. We also note that there exist the firms with the high value of betweenness coefficients meaning high potential of technology development. They should be managed carefully in terms of technology development policy.

K Nearest Neighbor Joins for Big Data Processing based on Spark (Spark 기반 빅데이터 처리를 위한 K-최근접 이웃 연결)

  • JIAQI, JI;Chung, Yeongjee
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.9
    • /
    • pp.1731-1737
    • /
    • 2017
  • K Nearest Neighbor Join (KNN Join) is a simple yet effective method in machine learning. It is widely used in small dataset of the past time. As the number of data increases, it is infeasible to run this model on an actual application by a single machine due to memory and time restrictions. Nowadays a popular batch process model called MapReduce which can run on a cluster with a large number of computers is widely used for large-scale data processing. Hadoop is a framework to implement MapReduce, but its performance can be further improved by a new framework named Spark. In the present study, we will provide a KNN Join implement based on Spark. With the advantage of its in-memory calculation capability, it will be faster and more effective than Hadoop. In our experiments, we study the influence of different factors on running time and demonstrate robustness and efficiency of our approach.

Professional Baseball Viewing Culture Survey According to Corona 19 using Social Network Big Data (소셜네트워크 빅데이터를 활용한 코로나 19에 따른 프로야구 관람문화조사)

  • Kim, Gi-Tak
    • Journal of Korea Entertainment Industry Association
    • /
    • v.14 no.6
    • /
    • pp.139-150
    • /
    • 2020
  • The data processing of this study focuses on the textom and social media words about three areas: 'Corona 19 and professional baseball', 'Corona 19 and professional baseball', and 'Corona 19 and professional sports' The data was collected and refined in a web environment and then processed in batch, and the Ucinet6 program was used to visualize it. Specifically, the web environment was collected using Naver, Daum, and Google's channels, and was summarized into 30 words through expert meetings among the extracted words and used in the final study. 30 extracted words were visualized through a matrix, and a CONCOR analysis was performed to identify clusters of similarity and commonality of words. As a result of analysis, the clusters related to Corona 19 and Pro Baseball were composed of one central cluster and five peripheral clusters, and it was found that the contents related to the opening of professional baseball according to the corona 19 wave were mainly searched. The cluster related to Corona 19 and unrelated to professional baseball consisted of one central cluster and five peripheral clusters, and it was found that the keyword of the position of professional baseball related to the professional baseball game according to Corona 19 was mainly searched. Corona 19 and the cluster related to professional sports consisted of one central cluster and five peripheral clusters, and it was found that the keywords related to the start of professional sports according to the aftermath of Corona 19 were mainly searched.

A Study on Distributed Parallel SWRL Inference in an In-Memory-Based Cluster Environment (인메모리 기반의 클러스터 환경에서 분산 병렬 SWRL 추론에 대한 연구)

  • Lee, Wan-Gon;Bae, Seok-Hyun;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.45 no.3
    • /
    • pp.224-233
    • /
    • 2018
  • Recently, there are many of studies on SWRL reasoning engine based on user-defined rules in a distributed environment using a large-scale ontology. Unlike the schema based axiom rules, efficient inference orders cannot be defined in SWRL rules. There is also a large volumet of network shuffled data produced by unnecessary iterative processes. To solve these problems, in this study, we propose a method that uses Map-Reduce algorithm and distributed in-memory framework to deduce multiple rules simultaneously and minimizes the volume data shuffling occurring between distributed machines in the cluster. For the experiment, we use WiseKB ontology composed of 200 million triples and 36 user-defined rules. We found that the proposed reasoner makes inferences in 16 minutes and is 2.7 times faster than previous reasoning systems that used LUBM benchmark dataset.

Delayed Block Replication Scheme of Hadoop Distributed File System for Flexible Management of Distributed Nodes (하둡 분산 파일시스템에서의 유연한 노드 관리를 위한 지연된 블록 복제 기법)

  • Ryu, Woo-Seok
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.12 no.2
    • /
    • pp.367-374
    • /
    • 2017
  • This paper discusses management problems of Hadoop distributed node, which is a platform for big data processing, and proposes a novel technique for enabling flexible node management of Hadoop Distributed File System. Hadoop cannot configure Hadoop cluster dynamically because it judges temporarily unavailable nodes as a failure. Delayed block replication scheme proposed in this paper delays the removal of unavailable node as much as possible so as to be easily rejoined. Experimental results show that the proposed scheme increases flexibility of node management with little impact on distributed processing performance when the cluster size changes.

Study on the shared in a distributed processing system (분산처리시스템에서 공유에 대한 고찰)

  • Kim, Gyu-Seok
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2016.10a
    • /
    • pp.346-347
    • /
    • 2016
  • 인프라로써의 전산자원은 사용자간에 공동 활용될 때 비용 효율을 얻을 수 있다. 이것은 web, was, dbms 중심의 기존 정보시스템 뿐 만 아니라 빅 데이터 등 분산처리가 요구되는 분야에서도 마찬가지이다. 분산처리 시스템에서 보안상 안전하게 전산 리소스를 공유하기 위한 요소를 살펴보고 사용자별로 격리된 클러스터를 제공하는 분산처리 공유 플랫폼을 구축하여 확인해 보았다.