• Title/Summary/Keyword: Hadoop framework

Search Result 83, Processing Time 0.024 seconds

Scalable RDFS Reasoning using Logic Programming Approach in a Single Machine (단일머신 환경에서의 논리적 프로그래밍 방식 기반 대용량 RDFS 추론 기법)

  • Jagvaral, Batselem;Kim, Jemin;Lee, Wan-Gon;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.41 no.10
    • /
    • pp.762-773
    • /
    • 2014
  • As the web of data is increasingly producing large RDFS datasets, it becomes essential in building scalable reasoning engines over large triples. There have been many researches used expensive distributed framework, such as Hadoop, to reason over large RDFS triples. However, in many cases we are required to handle millions of triples. In such cases, it is not necessary to deploy expensive distributed systems because logic program based reasoners in a single machine can produce similar reasoning performances with that of distributed reasoner using Hadoop. In this paper, we propose a scalable RDFS reasoner using logical programming methods in a single machine and compare our empirical results with that of distributed systems. We show that our logic programming based reasoner using a single machine performs as similar as expensive distributed reasoner does up to 200 million RDFS triples. In addition, we designed a meta data structure by decomposing the ontology triples into separate sectors. Instead of loading all the triples into a single model, we selected an appropriate subset of the triples for each ontology reasoning rule. Unification makes it easy to handle conjunctive queries for RDFS schema reasoning, therefore, we have designed and implemented RDFS axioms using logic programming unifications and efficient conjunctive query handling mechanisms. The throughputs of our approach reached to 166K Triples/sec over LUBM1500 with 200 million triples. It is comparable to that of WebPIE, distributed reasoner using Hadoop and Map Reduce, which performs 185K Triples/sec. We show that it is unnecessary to use the distributed system up to 200 million triples and the performance of logic programming based reasoner in a single machine becomes comparable with that of expensive distributed reasoner which employs Hadoop framework.

Implement on Search Machine using Open Source Framework (오픈 소스 프레임워크를 활용한 검색엔진 구현)

  • Song, Hyun-Ok;Kim, A-Yong;Jung, Hoe-Kyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.3
    • /
    • pp.552-557
    • /
    • 2015
  • IT technology development and smart appliances due to the increased use of a lot of data on production and consumption has become in the internet. Because this is why importance of information retrieval technology although the growing becoming aware of the difficult techniques to access the required of lot a background knowledge on information retrieval technology. However, the Lucene due to emerge provide to background can implement on search engine by using the Lucene of lack background knowledge for search technology. In this paper, suggest to implement on search engine by using the developed a framework on Lucene-based. Suggest a frameworks are use in the search engines on have guarantee in server environment support on distributed processing and distributed storage, and high availability by using the Hadoop and Nutch, Solr, Zookeeper.

A Design on Informal Big Data Topic Extraction System Based on Spark Framework (Spark 프레임워크 기반 비정형 빅데이터 토픽 추출 시스템 설계)

  • Park, Kiejin
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.521-526
    • /
    • 2016
  • As on-line informal text data have massive in its volume and have unstructured characteristics in nature, there are limitations in applying traditional relational data model technologies for data storage and data analysis jobs. Moreover, using dynamically generating massive social data, social user's real-time reaction analysis tasks is hard to accomplish. In the paper, to capture easily the semantics of massive and informal on-line documents with unsupervised learning mechanism, we design and implement automatic topic extraction systems according to the mass of the words that consists a document. The input data set to the proposed system are generated first, using N-gram algorithm to build multiple words to capture the meaning of the sentences precisely, and Hadoop and Spark (In-memory distributed computing framework) are adopted to run topic model. In the experiment phases, TB level input data are processed for data preprocessing and proposed topic extraction steps are applied. We conclude that the proposed system shows good performance in extracting meaningful topics in time as the intermediate results come from main memories directly instead of an HDD reading.

The Design of Method for Efficient Processing of Small Files in the Distributed System based on Hadoop Framework (하둡 프레임워크 기반 분산시스템 내의 작은 파일들을 효율적으로 처리하기 위한 방법의 설계)

  • Kim, Seung-Hyun;Kim, Young-Geun;Kim, Won-Jung
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.10 no.10
    • /
    • pp.1115-1122
    • /
    • 2015
  • Hadoop framework was designed to be suitable for processing very large files. On the other hand, when processing the Small Files, it waste the resource of a distributed system, and occur performance degradation. It is shown noticeable the more the Small Files. This problem is caused by the Small Files, it can be solved through the merging of associated Small Files. But a way of merging of Small Files has some limited point. in this paper, examines existing limit of merging method, design merging method Small Files for effective process.

Scalable P2P Botnet Detection with Threshold Setting in Hadoop Framework (하둡 프레임워크에서 한계점 가변으로 확장성이 가능한 P2P 봇넷 탐지 기법)

  • Huseynov, Khalid;Yoo, Paul D.;Kim, Kwangjo
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.25 no.4
    • /
    • pp.807-816
    • /
    • 2015
  • During the last decade most of coordinated security breaches are performed by the means of botnets, which is a large overlay network of compromised computers being controlled by remote botmaster. Due to high volumes of traffic to be analyzed, the challenge is posed by managing tradeoff between system scalability and accuracy. We propose a novel Hadoop-based P2P botnet detection method solving the problem of scalability and having high accuracy. Moreover, our approach is characterized not to require labeled data and applicable to encrypted traffic as well.

Performance Comparison of Python and Scala APIs in Spark Distributed Cluster Computing System (Spark 기반에서 Python과 Scala API의 성능 비교 분석)

  • Ji, Keung-yeup;Kwon, Youngmi
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.2
    • /
    • pp.241-246
    • /
    • 2020
  • Hadoop is a framework to process large data sets in a distributed way across clusters of nodes. It has been a popular platform to process big data, but in recent years, other platforms became competitive ones depending on the characteristics of the application. Spark is one of distributed platforms to enable real-time data processing and improve overall processing performance over Hadoop by introducing in-memory processing instead of disk I/O. Whereas Hadoop is designed to work on Java and data analysis is processed using Java API, Spark provides a variety of APIs with Scala, Python, Java and R. In this paper, the goal is to find out whether the APIs of different programming languages af ect the performances in Spark. We chose two popular APIs: Python and Scala. Python is easy to learn and is used in AI domain in a wide range. Scala is a programming language with advantages of parallelism. Our experiment shows much faster processing with Scala API than Python API. For the performance issues on AI-based analysis, further study is needed.

Design and Implementation of Big Data Platform for Image Processing in Agriculture (농업 이미지 처리를 위한 빅테이터 플랫폼 설계 및 구현)

  • Nguyen, Van-Quyet;Nguyen, Sinh Ngoc;Vu, Duc Tiep;Kim, Kyungbaek
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2016.10a
    • /
    • pp.50-53
    • /
    • 2016
  • Image processing techniques play an increasingly important role in many aspects of our daily life. For example, it has been shown to improve agricultural productivity in a number of ways such as plant pest detecting or fruit grading. However, massive quantities of images generated in real-time through multi-devices such as remote sensors during monitoring plant growth lead to the challenges of big data. Meanwhile, most current image processing systems are designed for small-scale and local computation, and they do not scale well to handle big data problems with their large requirements for computational resources and storage. In this paper, we have proposed an IPABigData (Image Processing Algorithm BigData) platform which provides algorithms to support large-scale image processing in agriculture based on Hadoop framework. Hadoop provides a parallel computation model MapReduce and Hadoop distributed file system (HDFS) module. It can also handle parallel pipelines, which are frequently used in image processing. In our experiment, we show that our platform outperforms traditional system in a scenario of image segmentation.

Implementation of a Raspberry-Pi-Sensor Network (라즈베리파이 센서 네트워크 구현)

  • Moon, Sangook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2014.10a
    • /
    • pp.915-916
    • /
    • 2014
  • With the upcoming era of internet of things, the study of sensor network has been paid attention. Raspberry pi is a tiny versatile computer system which is able to act as a sensor node in hadoop cluster network. In this paper, we deployed 5 Raspberry pi's to construct an experimental testbed of hadoop sensor network with 5-node map-reduce hadoop software framework. We compared and analyzed the network architecture in terms of efficiency, resource management, and throughput using various parameters. We used a learning machine with support vector machine as test workload. In our experiments, Raspberry pi fulfilled the role of distributed computing sensor node in the sensor network.

  • PDF

HBase based Business Process Event Log Schema Design of Hadoop Framework

  • Ham, Seonghun;Ahn, Hyun;Kim, Kwanghoon Pio
    • Journal of Internet Computing and Services
    • /
    • v.20 no.5
    • /
    • pp.49-55
    • /
    • 2019
  • Organizations design and operate business process models to achieve their goals efficiently and systematically. With the advancement of IT technology, the number of items that computer systems can participate in and the process becomes huge and complicated. This phenomenon created a more complex and subdivide flow of business process.The process instances that contain workcase and events are larger and have more data. This is an essential resource for process mining and is used directly in model discovery, analysis, and improvement of processes. This event log is getting bigger and broader, which leads to problems such as capacity management and I / O load in management of existing row level program or management through a relational database. In this paper, as the event log becomes big data, we have found the problem of management limit based on the existing original file or relational database. Design and apply schemes to archive and analyze large event logs through Hadoop, an open source distributed file system, and HBase, a NoSQL database system.

Access efficiency of small sized files in Big Data using various Techniques on Hadoop Distributed File System platform

  • Alange, Neeta;Mathur, Anjali
    • International Journal of Computer Science & Network Security
    • /
    • v.21 no.7
    • /
    • pp.359-364
    • /
    • 2021
  • In recent years Hadoop usage has been increasing day by day. The need of development of the technology and its specified outcomes are eagerly waiting across globe to adopt speedy access of data. Need of computers and its dependency is increasing day by day. Big data is exponentially growing as the entire world is working in online mode. Large amount of data has been produced which is very difficult to handle and process within a short time. In present situation industries are widely using the Hadoop framework to store, process and produce at the specified time with huge amount of data that has been put on the server. Processing of this huge amount of data having small files & its storage optimization is a big problem. HDFS, Sequence files, HAR, NHAR various techniques have been already proposed. In this paper we have discussed about various existing techniques which are developed for accessing and storing small files efficiently. Out of the various techniques we have specifically tried to implement the HDFS- HAR, NHAR techniques.