• 제목/요약/키워드: Big Data Cluster

검색결과 208건 처리시간 0.038초

RDP: A storage-tier-aware Robust Data Placement strategy for Hadoop in a Cloud-based Heterogeneous Environment

  • Muhammad Faseeh Qureshi, Nawab;Shin, Dong Ryeol
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제10권9호
    • /
    • pp.4063-4086
    • /
    • 2016
  • Cloud computing is a robust technology, which facilitate to resolve many parallel distributed computing issues in the modern Big Data environment. Hadoop is an ecosystem, which process large data-sets in distributed computing environment. The HDFS is a filesystem of Hadoop, which process data blocks to the cluster nodes. The data block placement has become a bottleneck to overall performance in a Hadoop cluster. The current placement policy assumes that, all Datanodes have equal computing capacity to process data blocks. This computing capacity includes availability of same storage media and same processing performances of a node. As a result, Hadoop cluster performance gets effected with unbalanced workloads, inefficient storage-tier, network traffic congestion and HDFS integrity issues. This paper proposes a storage-tier-aware Robust Data Placement (RDP) scheme, which systematically resolves unbalanced workloads, reduces network congestion to an optimal state, utilizes storage-tier in a useful manner and minimizes the HDFS integrity issues. The experimental results show that the proposed approach reduced unbalanced workload issue to 72%. Moreover, the presented approach resolve storage-tier compatibility problem to 81% by predicting storage for block jobs and improved overall data block placement by 78% through pre-calculated computing capacity allocations and execution of map files over respective Namenode and Datanodes.

빅데이타 분석을 통한 유통산업 클러스터의 형성과 생태계 연구 (A Study On Clusters and Ecosystem In Distribution Industry Using Big Data Analysis)

  • 정재헌
    • 한국콘텐츠학회논문지
    • /
    • 제19권7호
    • /
    • pp.360-375
    • /
    • 2019
  • 본 연구는 2015년도 5만 여개의 기업거래정보(KED(한국기업정보))를 이용하여, 유통업체와 관련이 있는, 지속적인 거래관계를 가진 기업들의 거래관계 네트워크를 형성하여 유통업체의 생태계를 파악하고자 하였다. 클러스터링의 방법을 사용한 결과, 5개 이상의 기업들로 이루어진 731개의 클러스터로 묶여진다. 이들은 KED 자료에서 파악되는 유통산업 매출의 약 80%룰 차지한다. 클러스터들은 소속된 업체들의 거래가 대부분 내부에서 완결되는 모듈화된 거래 패턴을 가진다. 유통 클러스터들은 그들 매출의 70% 이상을 하나 또는 2, 3개의 업체(주기업)가 차지하고 있다. 이러한 특징은 제조업과 유사하다. 그렇지만 유통 클러스터들은 소속 기업수가 제조업과 비교하여 작은 특징을 지니고 있으며 조립 제조업체들에 비해서는 특정 업체와 클러스터에 매출이 집중된 정도도 약하다. 기업연관분석의 결과를 보면, 30대 유통업체들이 소속된 클러스터내의 중소기업들의 주기업에 대한 매출의존도는 롯데쇼핑, 이마트, 이랜드리테일, 신세계, 현대홈쇼핑 등이 최소 35% 이상의 값을 보인다. 이들 클러스터 내에서의 공정거래정책을 통한 중소기업 육성정책의 여지가 크다는 점을 암시한다. 씨제이홈쇼핑, 현대홈쇼핑, 한무쇼핑 등은 아주 높은 생산유발효과를 가지며, 앞의 두 업체는 동일 클러스터 소속 중소기업들에 특히 높은 생산유발효과를 가지고 있다. 그리고 1-9번 클러스터들은 중소기업의 고용 비중이 높고 중소기업의 고용계수가 매우 높은 10번 기업군과 31번 기업군에서 상품을 조달하는 경우가 많다. 중소기업에 높은 생산 및 고용 유발효과를 가지고 있거나, 10, 31번 기업군에 후방연관효과가 높은 기업들은 중소기업 성장 및 고용 정책에서 중시되어야 할 것이다.

빅데이터 통합모형 비교분석 (Comparison analysis of big data integration models)

  • 정병호;임동훈
    • Journal of the Korean Data and Information Science Society
    • /
    • 제28권4호
    • /
    • pp.755-768
    • /
    • 2017
  • 빅데이터가 4차 산업혁명의 핵심으로 자리하면서 빅데이터 기반 처리 및 분석 능력이 기업의 미래 경쟁력을 좌우할 전망이다. 빅데이터 처리 및 분석을 위한 RHadoop과 RHIPE 모형은 R과 Hadoop의 통합모형으로 지금까지 각각의 모형에 대해서는 연구가 많이 진행되어 왔으나 두 모형간 비교 연구는 거의 이루어 지지 않았다. 본 논문에서는 대용량의 실제 데이터와 모의실험 데이터에서 다중 회귀 (multiple regression)와 로지스틱 회귀 (logistic regression) 추정을 위한 머신러닝 (machine learning) 알고리즘을 MapReduce 프로그램 구현을 통해 RHadoop과 RHIPE 간의 비교 분석하고자 한다. 구축된 분산 클러스터 (distributed cluster) 하에서 두 모형간 성능 실험 결과, RHIPE은 RHadoop에 비해 대체로 빠른 처리속도를 보인 반면에 설치, 사용면에서 어려움을 보였다.

A Novel Node Management in Hadoop Cluster by using DNA

  • Balaraju. J;PVRD. Prasada Rao
    • International Journal of Computer Science & Network Security
    • /
    • 제23권9호
    • /
    • pp.134-140
    • /
    • 2023
  • The distributed system is playing a vital role in storing and processing big data and data generation is speedily increasing from various sources every second. Hadoop has a scalable, and efficient distributed system supporting commodity hardware by combining different networks in the topographical locality. Node support in the Hadoop cluster is rapidly increasing in different versions which are facing difficulty to manage clusters. Hadoop does not provide Node management, adding and deletion node futures. Node identification in a cluster completely depends on DHCP servers which managing IP addresses, hostname based on the physical address (MAC) address of each Node. There is a scope to the hacker to theft the data using IP or Hostname and creating a disturbance in a distributed system by adding a malicious node, assigning duplicate IP. This paper proposing novel node management for the distributed system using DNA hiding and generating a unique key using a unique physical address (MAC) of each node and hostname. The proposed mechanism is providing better node management for the Hadoop cluster providing adding and deletion node mechanism by using limited computations and providing better node security from hackers. The main target of this paper is to propose an algorithm to implement Node information hiding in DNA sequences to increase and provide security to the node from hackers.

Symbolic Cluster Analysis for Distribution Valued Dissimilarity

  • Matsui, Yusuke;Minami, Hiroyuki;Misuta, Masahiro
    • Communications for Statistical Applications and Methods
    • /
    • 제21권3호
    • /
    • pp.225-234
    • /
    • 2014
  • We propose a novel hierarchical clustering for distribution valued dissimilarities. Analysis of large and complex data has attracted significant interest. Symbolic Data Analysis (SDA) was proposed by Diday in 1980's, which provides a new framework for statistical analysis. In SDA, we analyze an object with internal variation, including an interval, a histogram and a distribution, called a symbolic object. In the study, we focus on a cluster analysis for distribution valued dissimilarities, one of the symbolic objects. A hierarchical clustering has two steps in general: find out step and update step. In the find out step, we find the nearest pair of clusters. We extend it for distribution valued dissimilarities, introducing a measure on their order relations. In the update step, dissimilarities between clusters are redefined by mixture of distributions with a mixing ratio. We show an actual example of the proposed method and a simulation study.

Comparison of time series clustering methods and application to power consumption pattern clustering

  • Kim, Jaehwi;Kim, Jaehee
    • Communications for Statistical Applications and Methods
    • /
    • 제27권6호
    • /
    • pp.589-602
    • /
    • 2020
  • The development of smart grids has enabled the easy collection of a large amount of power data. There are some common patterns that make it useful to cluster power consumption patterns when analyzing s power big data. In this paper, clustering analysis is based on distance functions for time series and clustering algorithms to discover patterns for power consumption data. In clustering, we use 10 distance measures to find the clusters that consider the characteristics of time series data. A simulation study is done to compare the distance measures for clustering. Cluster validity measures are also calculated and compared such as error rate, similarity index, Dunn index and silhouette values. Real power consumption data are used for clustering, with five distance measures whose performances are better than others in the simulation.

빅데이터 클러스터에서의 추출된 형태소를 이용한 유사 동영상 추천 시스템 설계 (A Design of Similar Video Recommendation System using Extracted Words in Big Data Cluster)

  • 이현섭;김진덕
    • 한국정보통신학회논문지
    • /
    • 제24권2호
    • /
    • pp.172-178
    • /
    • 2020
  • 최근 널리 이용되고 있는 동영상 공유 서비스에서는 콘텐츠 추천 시스템이 매우 중요한 요소이다. 콘텐츠 추천을 위해서 일반적으로 사용자 선호도와 동영상(아이템) 유사도를 동시에 고려하는 협업 필터링을 사용하고 있다. 그러한 서비스는 주로 사용자의 검색 키워드와 시청시간과 같은 개인 선호도를 활용하여 사용자의 편의를 도모한다. 또한 동영상에 지정한 키워드를 중심으로 랭킹화한다. 그러나 한정된 키워드만을 이용한 동영상 유사도를 분석한다는 한계가 있다. 이런 경우 지정한 키워드가 아이템을 제대로 반영하지 못하는 경우 그 문제가 심각해진다. 이 논문에서는 교육 동영상으로부터 차별화된 의미를 갖는 모든 단어를 고려하여 유사도를 분석하며, 이런 경우 데이터와 연산의 규모가 방대하기 때문에 빅데이터 클러스터에서 처리하는 방법을 적용한다. 제안한 시스템은 빅데이터 영상 분석을 통해 동영상 공유 서비스 플랫폼의 기본 모듈로 활용될 것으로 기대한다.

Recommendation of tourist attractions based on Preferences using big data

  • KIM HYUN SEOK;Gi-hwan Ryu;kim im yeo-reum
    • International Journal of Advanced Culture Technology
    • /
    • 제11권3호
    • /
    • pp.327-331
    • /
    • 2023
  • This paper proposes a tourist destination recommendation application that combines a chatbot and a recommendation system. The data to be entered into the chatbot was through big data on social media. Through TEXTOM, a total of 22,701 data were collected over a one-year period from January 2022 to January 2023. Non-terms that interfere with analysis were removed through the data purification process. Using refined data, network visualization and CONCOR analysis were used to identify the information users want to obtain about travel to Jeju Island, and categories for each cluster were organized. The content was intuitively organized so that even those who approached it for the first time could easily use it, reducing the difficulty of operating the application. In this paper, users can select their own preferences and receive information. In addition, a tool called a chatbot allows users to focus more on the process of acquiring information by gaining a sense of reality while operating the application. This suggests an application that can reach the purpose of the curator by affecting the user's desire to visit tourist attractions.

위상수학적 데이터 분석법을 이용한 수학학습 불안 분석 사례 (Mathematics Anxiety Analysis using Topological Data Analysis)

  • 고호경;박선정
    • East Asian mathematical journal
    • /
    • 제34권2호
    • /
    • pp.177-189
    • /
    • 2018
  • Recently, Topological Data Analysis (TDA) has attracted attention among various techniques for analyzing big data. Mapper algorithm, which is one of TDA techniques, is used to visualize the cluster diagram. In this study, students were clustered according to the characteristics and degree of mathematics anxiety using a mapper, and students were visualized according to mathematics anxiety. In order to do this, Mathematical Anxiety Scale (Ko & Yi, 2011) in the aspect of mathematical instability in terms of teaching - learning, ie, Nature of Mathematics, Learning Strategy, Test/Performance is used. And the number of questions that measure the anxiety of mathematics can be extracted by extracting the most relevant items among the items that measure the anxiety of mathematics.

Incidence of Online Public Opinion on Guangzhou Simultaneous Renting and Purchasing Policy - A data mining application

  • Wang, Yancheng;Li, Haixian
    • Asian Journal for Public Opinion Research
    • /
    • 제5권4호
    • /
    • pp.266-284
    • /
    • 2018
  • This paper adopts the big data research method, and draws 491 data from the Tianya Forum about the Simultaneous Renting and Purchasing policy of Guangzhou. The qualitative analysis software Nvivo11 is used to cluster the main questions about the Simultaneous Renting and Purchasing policy in the forum. The 36 high-frequency word frequencies are obtained through text clustering. Through rooted theory analysis, the main driving factors for summarizing people's doubts are 9 main categories, 3 core categories, and the model of driving factors for online forums is established. The study finds that resource factors are the most key factor, economic factors are the important drivers, and policy guiding factors are sub-important drivers.