• Title/Summary/Keyword: Big data processing

Search Result 1,038, Processing Time 0.029 seconds

Study on the Direction of Universal Big Data and Big Data Education-Based on the Survey of Big Data Experts (보편적 빅데이터와 빅데이터 교육의 방향성 연구 - 빅데이터 전문가의 인식 조사를 기반으로)

  • Park, Youn-Soo;Lee, Su-Jin
    • Journal of The Korean Association of Information Education
    • /
    • v.24 no.2
    • /
    • pp.201-214
    • /
    • 2020
  • Big data is gradually expanding in diverse fields, with changing the data-related legislation. Moreover it would be interest in big data education. However, it requires a high level of knowledge and skills in order to utilize Big Data and it takes a long time for education spends a lot of money for training. We study that in order to define Universal Big Data used to the industrial field in a wide range. As a result, we make the paradigm for Big Data education for college students. We survey to the professional the Big Data definition and the Big Data perception. According to the survey, the Big Data related-professional recognize that is a wider definition than Computer Science Big Data is. Also they recognize that the Big Data Processing dose not be required Big Data Processing Frameworks or High Performance Computers. This means that in order to educate Big Data, it is necessary to focus on the analysis methods and application methods of Universal Big Data rather than computer science (Engineering) knowledge and skills. Based on the our research, we propose the Universal Big Data education on the new paradigm.

Big Data Management Scheme using Property Information based on Cluster Group in adopt to Hadoop Environment (하둡 환경에 적합한 클러스터 그룹 기반 속성 정보를 이용한 빅 데이터 관리 기법)

  • Han, Kun-Hee;Jeong, Yoon-Su
    • Journal of Digital Convergence
    • /
    • v.13 no.9
    • /
    • pp.235-242
    • /
    • 2015
  • Social network technology has been increasing interest in the big data service and development. However, the data stored in the distributed server and not on the central server technology is easy enough to find and extract. In this paper, we propose a big data management techniques to minimize the processing time of information you want from the content server and the management server that provides big data services. The proposed method is to link the in-group data, classified data and groups according to the type, feature, characteristic of big data and the attribute information applied to a hash chain. Further, the data generated to extract the stored data in the distributed server to record time for improving the data index information processing speed of the data classification of the multi-attribute information imparted to the data. As experimental result, The average seek time of the data through the number of cluster groups was increased an average of 14.6% and the data processing time through the number of keywords was reduced an average of 13%.

Correlation Measure for Big Data (빅데이터에서의 상관성 측도)

  • Jeong, Hai Sung
    • Journal of Applied Reliability
    • /
    • v.18 no.3
    • /
    • pp.208-212
    • /
    • 2018
  • Purpose: The three Vs of volume, velocity and variety are commonly used to characterize different aspects of Big Data. Volume refers to the amount of data, variety refers to the number of types of data and velocity refers to the speed of data processing. According to these characteristics, the size of Big Data varies rapidly, some data buckets will contain outliers, and buckets might have different sizes. Correlation plays a big role in Big Data. We need something better than usual correlation measures. Methods: The correlation measures offered by traditional statistics are compared. And conditions to meet the characteristics of Big Data are suggested. Finally the correlation measure that satisfies the suggested conditions is recommended. Results: Mutual Information satisfies the suggested conditions. Conclusion: This article builds on traditional correlation measures to analyze the co-relation between two variables. The conditions for correlation measures to meet the characteristics of Big Data are suggested. The correlation measure that satisfies these conditions is recommended. It is Mutual Information.

Outlier Detection Based on MapReduce for Analyzing Big Data (대용량 데이터 분석을 위한 맵리듀스 기반의 이상치 탐지)

  • Hong, Yejin;Na, Eunhee;Jung, Yonghwan;Kim, Yangwoo
    • Journal of Internet Computing and Services
    • /
    • v.18 no.1
    • /
    • pp.27-35
    • /
    • 2017
  • In near future, IoT data is expected to be a major portion of Big Data. Moreover, sensor data is expected to be major portion of IoT data, and its' research is actively carried out currently. However, processed results may not be trusted and used if outlier data is included in the processing of sensor data. Therefore, method for detection and deletion of those outlier data before processing is studied in this paper. Moreover, we used Spark which is memory based distributed processing environment for fast processing of big sensor data. The detection and deletion of outlier data consist of four stages, and each stage is implemented with Mapper and Reducer operation. The proposed method is compared in three different processing environments, and it is expected that the outlier detection and deletion performance is best in the distributed Spark environment as data volume is increasing.

A Design of DBaaS-Based Collaboration System for Big Data Processing

  • Jung, Yean-Woo;Lee, Jong-Yong;Jung, Kye-Dong
    • International journal of advanced smart convergence
    • /
    • v.5 no.2
    • /
    • pp.59-65
    • /
    • 2016
  • With the recent growth in cloud computing, big data processing and collaboration between businesses are emerging as new paradigms in the IT industry. In an environment where a large amount of data is generated in real time, such as SNS, big data processing techniques are useful in extracting the valid data. MapReduce is a good example of such a programming model used in big data extraction. With the growing collaboration between companies, problems of duplication and heterogeneity among data due to the integration of old and new information storage systems have arisen. These problems arise because of the differences in existing databases across the various companies. However, these problems can be negated by implementing the MapReduce technique. This paper proposes a collaboration system based on Database as a Service, or DBaaS, to solve problems in data integration for collaboration between companies. The proposed system can reduce the overhead in data integration, while being applied to structured and unstructured data.

Big Data Smoothing and Outlier Removal for Patent Big Data Analysis

  • Choi, JunHyeog;Jun, Sunghae
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.8
    • /
    • pp.77-84
    • /
    • 2016
  • In general statistical analysis, we need to make a normal assumption. If this assumption is not satisfied, we cannot expect a good result of statistical data analysis. Most of statistical methods processing the outlier and noise also need to the assumption. But the assumption is not satisfied in big data because of its large volume and heterogeneity. So we propose a methodology based on box-plot and data smoothing for controling outlier and noise in big data analysis. The proposed methodology is not dependent upon the normal assumption. In addition, we select patent documents as target domain of big data because patent big data analysis is a important issue in management of technology. We analyze patent documents using big data learning methods for technology analysis. The collected patent data from patent databases on the world are preprocessed and analyzed by text mining and statistics. But the most researches about patent big data analysis did not consider the outlier and noise problem. This problem decreases the accuracy of prediction and increases the variance of parameter estimation. In this paper, we check the existence of the outlier and noise in patent big data. To know whether the outlier is or not in the patent big data, we use box-plot and smoothing visualization. We use the patent documents related to three dimensional printing technology to illustrate how the proposed methodology can be used for finding the existence of noise in the searched patent big data.

Big Data Platform for Utilizing and Analyzing Real-Time Sensing Information in Industrial Sites (산업현장 실시간 센싱정보 활용/분석을 위한 빅데이터 플랫폼)

  • Lee, Yonghwan;Suh, Jinhyung
    • Journal of Creative Information Culture
    • /
    • v.6 no.1
    • /
    • pp.15-21
    • /
    • 2020
  • In order to utilize big data in general industrial sites, the structured big data collected from facilities, processes, and environments of industrial sites must first be processed and stored, and in the case of unstructured data, it must be stored as unstructured data or converted into structured data and stored in a database. In this paper, we study a method of collecting big data based on open IoT standards that can converge and utilize measurement information, environmental information of industrial sites to collect big data. The platform for collecting big data proposed in this paper is capable of collecting, processing, and storing big data at industrial sites to process real-time sensing information. For processing and analyzing data according to the purpose of the stored industrial, various big data technologies also can be applied.

Design and Implementation of Incremental Learning Technology for Big Data Mining

  • Min, Byung-Won;Oh, Yong-Sun
    • International Journal of Contents
    • /
    • v.15 no.3
    • /
    • pp.32-38
    • /
    • 2019
  • We usually suffer from difficulties in treating or managing Big Data generated from various digital media and/or sensors using traditional mining techniques. Additionally, there are many problems relative to the lack of memory and the burden of the learning curve, etc. in an increasing capacity of large volumes of text when new data are continuously accumulated because we ineffectively analyze total data including data previously analyzed and collected. In this paper, we propose a general-purpose classifier and its structure to solve these problems. We depart from the current feature-reduction methods and introduce a new scheme that only adopts changed elements when new features are partially accumulated in this free-style learning environment. The incremental learning module built from a gradually progressive formation learns only changed parts of data without any re-processing of current accumulations while traditional methods re-learn total data for every adding or changing of data. Additionally, users can freely merge new data with previous data throughout the resource management procedure whenever re-learning is needed. At the end of this paper, we confirm a good performance of this method in data processing based on the Big Data environment throughout an analysis because of its learning efficiency. Also, comparing this algorithm with those of NB and SVM, we can achieve an accuracy of approximately 95% in all three models. We expect that our method will be a viable substitute for high performance and accuracy relative to large computing systems for Big Data analysis using a PC cluster environment.

A Study on Performance Evaluation of Container-based Virtualization for Real-Time Data Analysis (실시간 데이터 분석을 위한 컨테이너 기반 가상화 성능에 관한 연구)

  • Choi, BoAh;Han, JaeDeok;Oh, DaSom;Park, HyunKook;Kim, HyeonA;Seo, MinKwan;Lee, JongHyuk
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.05a
    • /
    • pp.32-35
    • /
    • 2020
  • 본 논문은 실시간 데이터 분석을 위한 컨테이너 가상화 기술 사용에 대한 효용성을 알아보기 위해 HDP 와 MapR 배포판에 포함된 Spark 를 도커라이징 전과 후 환경에 설치 후 HiBench 벤치마크 프로그램을 이용해 성능을 측정하였다. 그리고 성능 측정치에 대해 대응표본 t 검정을 이용하여 도커라이징 전과 후의 성능 차이가 있는지를 통계적으로 분석하였다. 분석 결과, HDP 는 도커라이징 전과 후에 대한 성능 차이가 있었지만 MapR 은 성능 차이가 없었다.