• Title/Summary/Keyword: HADOOP

Search Result 398, Processing Time 0.028 seconds

Design of Distributed Cloud System for Managing large-scale Genomic Data

  • Seine Jang;Seok-Jae Moon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.2
    • /
    • pp.119-126
    • /
    • 2024
  • The volume of genomic data is constantly increasing in various modern industries and research fields. This growth presents new challenges and opportunities in terms of the quantity and diversity of genetic data. In this paper, we propose a distributed cloud system for integrating and managing large-scale gene databases. By introducing a distributed data storage and processing system based on the Hadoop Distributed File System (HDFS), various formats and sizes of genomic data can be efficiently integrated. Furthermore, by leveraging Spark on YARN, efficient management of distributed cloud computing tasks and optimal resource allocation are achieved. This establishes a foundation for the rapid processing and analysis of large-scale genomic data. Additionally, by utilizing BigQuery ML, machine learning models are developed to support genetic search and prediction, enabling researchers to more effectively utilize data. It is expected that this will contribute to driving innovative advancements in genetic research and applications.

Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

  • Priyanka Purushu;Niklas Melcher;Bhagyashree Bhagwat;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • v.28 no.4
    • /
    • pp.308-319
    • /
    • 2018
  • This paper aims at providing valuable insights on Financial Fraud Detection on a mobile money transactional activity. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure and Spark ML, which are traditional systems and Big Data respectively. Experimenting with sample dataset in Azure, we found that the Decision Forest model is the most accurate to proceed in terms of the recall value. For the massive data set using Spark ML, it is found that the Random Forest classifier algorithm of the classification model proves to be the best algorithm. It is presented that the Spark cluster gets much faster to build and evaluate models as adding more servers to the cluster with the same accuracy, which proves that the large scale data set can be predictable using Big Data platform. Finally, we reached a recall score with 0.73, which implies a satisfying prediction quality in predicting fraudulent transactions.

Twitter Issue Tracking System by Topic Modeling Techniques (토픽 모델링을 이용한 트위터 이슈 트래킹 시스템)

  • Bae, Jung-Hwan;Han, Nam-Gi;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.109-122
    • /
    • 2014
  • People are nowadays creating a tremendous amount of data on Social Network Service (SNS). In particular, the incorporation of SNS into mobile devices has resulted in massive amounts of data generation, thereby greatly influencing society. This is an unmatched phenomenon in history, and now we live in the Age of Big Data. SNS Data is defined as a condition of Big Data where the amount of data (volume), data input and output speeds (velocity), and the variety of data types (variety) are satisfied. If someone intends to discover the trend of an issue in SNS Big Data, this information can be used as a new important source for the creation of new values because this information covers the whole of society. In this study, a Twitter Issue Tracking System (TITS) is designed and established to meet the needs of analyzing SNS Big Data. TITS extracts issues from Twitter texts and visualizes them on the web. The proposed system provides the following four functions: (1) Provide the topic keyword set that corresponds to daily ranking; (2) Visualize the daily time series graph of a topic for the duration of a month; (3) Provide the importance of a topic through a treemap based on the score system and frequency; (4) Visualize the daily time-series graph of keywords by searching the keyword; The present study analyzes the Big Data generated by SNS in real time. SNS Big Data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. In addition, such analysis requires the latest big data technology to process rapidly a large amount of real-time data, such as the Hadoop distributed system or NoSQL, which is an alternative to relational database. We built TITS based on Hadoop to optimize the processing of big data because Hadoop is designed to scale up from single node computing to thousands of machines. Furthermore, we use MongoDB, which is classified as a NoSQL database. In addition, MongoDB is an open source platform, document-oriented database that provides high performance, high availability, and automatic scaling. Unlike existing relational database, there are no schema or tables with MongoDB, and its most important goal is that of data accessibility and data processing performance. In the Age of Big Data, the visualization of Big Data is more attractive to the Big Data community because it helps analysts to examine such data easily and clearly. Therefore, TITS uses the d3.js library as a visualization tool. This library is designed for the purpose of creating Data Driven Documents that bind document object model (DOM) and any data; the interaction between data is easy and useful for managing real-time data stream with smooth animation. In addition, TITS uses a bootstrap made of pre-configured plug-in style sheets and JavaScript libraries to build a web system. The TITS Graphical User Interface (GUI) is designed using these libraries, and it is capable of detecting issues on Twitter in an easy and intuitive manner. The proposed work demonstrates the superiority of our issue detection techniques by matching detected issues with corresponding online news articles. The contributions of the present study are threefold. First, we suggest an alternative approach to real-time big data analysis, which has become an extremely important issue. Second, we apply a topic modeling technique that is used in various research areas, including Library and Information Science (LIS). Based on this, we can confirm the utility of storytelling and time series analysis. Third, we develop a web-based system, and make the system available for the real-time discovery of topics. The present study conducted experiments with nearly 150 million tweets in Korea during March 2013.

SSQUSAR : A Large-Scale Qualitative Spatial Reasoner Using Apache Spark SQL (SSQUSAR : Apache Spark SQL을 이용한 대용량 정성 공간 추론기)

  • Kim, Jonghoon;Kim, Incheol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.2
    • /
    • pp.103-116
    • /
    • 2017
  • In this paper, we present the design and implementation of a large-scale qualitative spatial reasoner, which can derive new qualitative spatial knowledge representing both topological and directional relationships between two arbitrary spatial objects in efficient way using Aparch Spark SQL. Apache Spark SQL is well known as a distributed parallel programming environment which provides both efficient join operations and query processing functions over a variety of data in Hadoop cluster computer systems. In our spatial reasoner, the overall reasoning process is divided into 6 jobs such as knowledge encoding, inverse reasoning, equal reasoning, transitive reasoning, relation refining, knowledge decoding, and then the execution order over the reasoning jobs is determined in consideration of both logical causal relationships and computational efficiency. The knowledge encoding job reduces the size of knowledge base to reason over by transforming the input knowledge of XML/RDF form into one of more precise form. Repeat of the transitive reasoning job and the relation refining job usually consumes most of computational time and storage for the overall reasoning process. In order to improve the jobs, our reasoner finds out the minimal disjunctive relations for qualitative spatial reasoning, and then, based upon them, it not only reduces the composition table to be used for the transitive reasoning job, but also optimizes the relation refining job. Through experiments using a large-scale benchmarking spatial knowledge base, the proposed reasoner showed high performance and scalability.

A Scalable OWL Horst Lite Ontology Reasoning Approach based on Distributed Cluster Memories (분산 클러스터 메모리 기반 대용량 OWL Horst Lite 온톨로지 추론 기법)

  • Kim, Je-Min;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.42 no.3
    • /
    • pp.307-319
    • /
    • 2015
  • Current ontology studies use the Hadoop distributed storage framework to perform map-reduce algorithm-based reasoning for scalable ontologies. In this paper, however, we propose a novel approach for scalable Web Ontology Language (OWL) Horst Lite ontology reasoning, based on distributed cluster memories. Rule-based reasoning, which is frequently used for scalable ontologies, iteratively executes triple-format ontology rules, until the inferred data no longer exists. Therefore, when the scalable ontology reasoning is performed on computer hard drives, the ontology reasoner suffers from performance limitations. In order to overcome this drawback, we propose an approach that loads the ontologies into distributed cluster memories, using Spark (a memory-based distributed computing framework), which executes the ontology reasoning. In order to implement an appropriate OWL Horst Lite ontology reasoning system on Spark, our method divides the scalable ontologies into blocks, loads each block into the cluster nodes, and subsequently handles the data in the distributed memories. We used the Lehigh University Benchmark, which is used to evaluate ontology inference and search speed, to experimentally evaluate the methods suggested in this paper, which we applied to LUBM8000 (1.1 billion triples, 155 gigabytes). When compared with WebPIE, a representative mapreduce algorithm-based scalable ontology reasoner, the proposed approach showed a throughput improvement of 320% (62k/s) over WebPIE (19k/s).

SPQUSAR : A Large-Scale Qualitative Spatial Reasoner Using Apache Spark (SPQUSAR : Apache Spark를 이용한 대용량의 정성적 공간 추론기)

  • Kim, Jongwhan;Kim, Jonghoon;Kim, Incheol
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.12
    • /
    • pp.774-779
    • /
    • 2015
  • In this paper, we present the design and implementation of a large-scale qualitative spatial reasoner using Apache Spark, an in-memory high speed cluster computing environment, which is effective for sequencing and iterating component reasoning jobs. The proposed reasoner can not only check the integrity of a large-scale spatial knowledge base representing topological and directional relationships between spatial objects, but also expand the given knowledge base by deriving new facts in highly efficient ways. In general, qualitative reasoning on topological and directional relationships between spatial objects includes a number of composition operations on every possible pair of disjunctive relations. The proposed reasoner enhances computational efficiency by determining the minimal set of disjunctive relations for spatial reasoning and then reducing the size of the composition table to include only that set. Additionally, in order to improve performance, the proposed reasoner is designed to minimize disk I/Os during distributed reasoning jobs, which are performed on a Hadoop cluster system. In experiments with both artificial and real spatial knowledge bases, the proposed Spark-based spatial reasoner showed higher performance than the existing MapReduce-based one.

Research of Performance Interference Control Technique for Heterogeneous Services in Bigdata Platform (빅데이터 플랫폼에서 이종 서비스간 성능 간섭 현상 제어에 관한 연구)

  • Jin, Kisung;Lee, Sangmin;Kim, Youngkyun
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.6
    • /
    • pp.284-289
    • /
    • 2016
  • In the Hadoop-based Big Data analysis model, the data movement between the legacy system and the analysis system is difficult to avoid. To overcome this problem, a unified Big Data file system is introduced so that a unified platform can support the legacy service as well as the analysis service. However, major challenges in avoiding the performance degradation problem due to the interference of two services remain. In order to solve this problem, we first performed a real-life simulation and observed resource utilization, workload characteristics and I/O balanced level. Based on this analysis, two solutions were proposed both for the system level and for the technical level. In the system level, we divide I/O path into the legacy I/O path and the analysis I/O path. In the technical level, we introduce an aggressive prefetch method for analysis service which requires the sequential read. Also, we introduce experimental results that shows the outstanding performance gain comparing the previous system.

A Study on Big Data Based Method of Patient Care Analysis (빅데이터 기반 환자 간병 방법 분석 연구)

  • Park, Ji-Hun;Hwang, Seung-Yeon;Yun, Bum-Sik;Choe, Su-Gil;Lee, Don-Hee;Kim, Jeong-Joon;Moon, Jin-Yong;Park, Kyung-won
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.3
    • /
    • pp.163-170
    • /
    • 2020
  • With the development of information and communication technologies, the growing volume of data is increasing exponentially, raising interest in big data. As technologies related to big data have developed, big data is being collected, stored, processed, analyzed, and utilized in many fields. Big data analytics in the health care sector, in particular, is receiving much attention because they can also have a huge social and economic impact. It is predicted that it will be able to use Big Data technology to analyze patients' diagnostic data and reduce the amount of money that is spent on simple hospital care. Therefore, in this thesis, patient data is analyzed to present to patients who are unable to go to the hospital or caregivers who do not have medical expertise with close care guidelines. First, the collected patient data is stored in HDFS and the data is processed and classified using R, a big data processing and analysis tool, in the Hadoop environment. Visualize to a web server using R Shiny, which is used to implement various functions of R on the web.

Performance Comparison of Spatial Split Algorithms for Spatial Data Analysis on Spark (Spark 기반 공간 분석에서 공간 분할의 성능 비교)

  • Yang, Pyoung Woo;Yoo, Ki Hyun;Nam, Kwang Woo
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.25 no.1
    • /
    • pp.29-36
    • /
    • 2017
  • In this paper, we implement a spatial big data analysis prototype based on Spark which is an in-memory system and compares the performance by the spatial split algorithm on this basis. In cluster computing environments, big data is divided into blocks of a certain size order to balance the computing load of big data. Existing research showed that in the case of the Hadoop based spatial big data system, the split method by spatial is more effective than the general sequential split method. Hadoop based spatial data system stores raw data as it is in spatial-divided blocks. However, in the proposed Spark-based spatial analysis system, there is a difference that spatial data is converted into a memory data structure and stored in a spatial block for search efficiency. Therefore, in this paper, we propose an in-memory spatial big data prototype and a spatial split block storage method. Also, we compare the performance of existing spatial split algorithms in the proposed prototype. We presented an appropriate spatial split strategy with the Spark based big data system. In the experiment, we compared the query execution time of the spatial split algorithm, and confirmed that the BSP algorithm shows the best performance.

Recommendation of Best Empirical Route Based on Classification of Large Trajectory Data (대용량 경로데이터 분류에 기반한 경험적 최선 경로 추천)

  • Lee, Kye Hyung;Jo, Yung Hoon;Lee, Tea Ho;Park, Heemin
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.2
    • /
    • pp.101-108
    • /
    • 2015
  • This paper presents the implementation of a system that recommends empirical best routes based on classification of large trajectory data. As many location-based services are used, we expect the amount of location and trajectory data to become big data. Then, we believe we can extract the best empirical routes from the large trajectory repositories. Large trajectory data is clustered into similar route groups using Hadoop MapReduce framework. Clustered route groups are stored and managed by a DBMS, and thus it supports rapid response to the end-users' request. We aim to find the best routes based on collected real data, not the ideal shortest path on maps. We have implemented 1) an Android application that collects trajectories from users, 2) Apache Hadoop MapReduce program that can cluster large trajectory data, 3) a service application to query start-destination from a web server and to display the recommended routes on mobile phones. We validated our approach using real data we collected for five days and have compared the results with commercial navigation systems. Experimental results show that the empirical best route is better than routes recommended by commercial navigation systems.