• Title/Summary/Keyword: and Sorting algorithms

Search Result 97, Processing Time 0.018 seconds

External Merge Sorting in Tajo with Variable Server Configuration (매개변수 환경설정에 따른 타조의 외부합병정렬 성능 연구)

  • Lee, Jongbaeg;Kang, Woon-hak;Lee, Sang-won
    • Journal of KIISE
    • /
    • v.43 no.7
    • /
    • pp.820-826
    • /
    • 2016
  • There is a growing requirement for big data processing which extracts valuable information from a large amount of data. The Hadoop system employs the MapReduce framework to process big data. However, MapReduce has limitations such as inflexible and slow data processing. To overcome these drawbacks, SQL query processing techniques known as SQL-on-Hadoop were developed. Apache Tajo, one of the SQL-on-Hadoop techniques, was developed by a Korean development group. External merge sort is one of the heavily used algorithms in Tajo for query processing. The performance of external merge sort in Tajo is influenced by two parameters, sort buffer size and fanout. In this paper, we analyzed the performance of external merge sort in Tajo with various sort buffer sizes and fanouts. In addition, we figured out that there are two major causes of differences in the performance of external merge sort: CPU cache misses which increase as the sort buffer size grows; and the number of merge passes determined by fanout.

Analyzing Problem Instance Space Based on Difficulty-distance Correlation (난이도-거리 상관관계 기반의 문제 인스턴스 공간 분석)

  • Jeon, So-Yeong;Kim, Yong-Hyuk
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.22 no.4
    • /
    • pp.414-424
    • /
    • 2012
  • Finding or automatically generating problem instance is useful for algorithm analysis/test. The topic has been of interest in the field of hardware/software engineering and theory of computation. We apply objective value-distance correlation analysis to problem spaces, as previous researchers applied it to solution spaces. According to problems, we define the objective function by (1) execution time of tested algorithm or (2) its optimality; this definition is interpreted as difficulty of the problem instance being solved. Our correlation analysis is based on the following aspects: (1) change of correlation when we use different algorithms or different distance functions for the same problem, (2) change of that when we improve the tested algorithm, (3) relation between a problem instance space and the solution space for the same problem. Our research demonstrates the way of problem instance space analysis and will accelerate the problem instance space analysis as an initiative research.

Sort-Based Distributed Parallel Data Cube Computation Algorithm using MapReduce (맵리듀스를 이용한 정렬 기반의 데이터 큐브 분산 병렬 계산 알고리즘)

  • Lee, Suan;Kim, Jinho
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.49 no.9
    • /
    • pp.196-204
    • /
    • 2012
  • Recently, many applications perform OLAP(On-Line Analytical Processing) over a very large volume of data. Multidimensional data cube is regarded as a core tool in OLAP analysis. This paper focuses on the method how to efficiently compute data cubes in parallel by using a popular parallel processing tool, MapReduce. We investigate efficient ways to implement PipeSort algorithm, a well-known data cube computation method, on the MapReduce framework. The PipeSort executes several (descendant) cuboids at the same time as a pipeline by scanning one (ancestor) cuboid once, which have the same sorting order. This paper proposed four ways implementing the pipeline of the PipeSort on the MapReduce framework which runs across 20 servers. Our experiments show that PipeMap-NoReduce algorithm outperforms the rest algorithms for high-dimensional data. On the contrary, Post-Pipe stands out above the others for low-dimensional data.

A study on the subset averaged median methods for gaussian noise reduction (가우시안 잡음 제거를 위한 부분 집합 평균 메디안 방법에 관한 연구)

  • 이용환;박장춘
    • Journal of the Korea Society of Computer and Information
    • /
    • v.4 no.2
    • /
    • pp.120-134
    • /
    • 1999
  • Image processing steps consist of image acquisition, pre-processing, region segmentation and recognition, and the images are easily corrupted by noise during the data transmission, data capture, and data processing. Impulse noise and gaussian noise are major noises, which can occur during the process. Many filters such as mean filter, median filter, weighted median filter, Cheikh filter, and Kyu-cheol Lee filter were proposed as spatial noise reduction filters so far. Many researches have been focused on the reduction of impulse noise, but comparatively the research in the reduction of gaussian noise has been neglected. For the reduction of gaussian noise, subset averaged median filter, using median information and subset average information of pixels in a window. was proposed. At this time, consider of the window size as 3$^{*}$3 pixel. The window is divided to 4 subsets consisted of 4 pixels. First of all, we calculate the average value of each subset, and then find the median value by sorting the average values and center pixel's value. In this paper, a better reduction of gaussian noise was proved. The proposed algorithms were implemented by ANSI C language on a Sun Ultra 2 for testing purposes and the effects and results of the filter in the various levels of noise and images were proposed by comparing the values of PSNR, MSE, and RMSE with the value of the other existing filtering methods.thods.

  • PDF

Artificial Intelligence Algorithms, Model-Based Social Data Collection and Content Exploration (소셜데이터 분석 및 인공지능 알고리즘 기반 범죄 수사 기법 연구)

  • An, Dong-Uk;Leem, Choon Seong
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.23-34
    • /
    • 2019
  • Recently, the crime that utilizes the digital platform is continuously increasing. About 140,000 cases occurred in 2015 and about 150,000 cases occurred in 2016. Therefore, it is considered that there is a limit handling those online crimes by old-fashioned investigation techniques. Investigators' manual online search and cognitive investigation methods those are broadly used today are not enough to proactively cope with rapid changing civil crimes. In addition, the characteristics of the content that is posted to unspecified users of social media makes investigations more difficult. This study suggests the site-based collection and the Open API among the content web collection methods considering the characteristics of the online media where the infringement crimes occur. Since illegal content is published and deleted quickly, and new words and alterations are generated quickly and variously, it is difficult to recognize them quickly by dictionary-based morphological analysis registered manually. In order to solve this problem, we propose a tokenizing method in the existing dictionary-based morphological analysis through WPM (Word Piece Model), which is a data preprocessing method for quick recognizing and responding to illegal contents posting online infringement crimes. In the analysis of data, the optimal precision is verified through the Vote-based ensemble method by utilizing a classification learning model based on supervised learning for the investigation of illegal contents. This study utilizes a sorting algorithm model centering on illegal multilevel business cases to proactively recognize crimes invading the public economy, and presents an empirical study to effectively deal with social data collection and content investigation.

  • PDF

Face Detection Using Adaboost and Template Matching of Depth Map based Block Rank Patterns (Adaboost와 깊이 맵 기반의 블록 순위 패턴의 템플릿 매칭을 이용한 얼굴검출)

  • Kim, Young-Gon;Park, Rae-Hong;Mun, Seong-Su
    • Journal of Broadcast Engineering
    • /
    • v.17 no.3
    • /
    • pp.437-446
    • /
    • 2012
  • A face detection algorithms using two-dimensional (2-D) intensity or color images have been studied for decades. Recently, with the development of low-cost range sensor, three-dimensional (3-D) information (i.e., depth image that represents the distance between a camera and objects) can be easily used to reliably extract facial features. Most people have a similar pattern of 3-D facial structure. This paper proposes a face detection method using intensity and depth images. At first, adaboost algorithm using intensity image classifies face and nonface candidate regions. Each candidate region is divided into $5{\times}5$ blocks and depth values are averaged in each block. Then, $5{\times}5$ block rank pattern is constructed by sorting block averages of depth values. Finally, candidate regions are classified as face and nonface regions by matching the constructed depth map based block rank patterns and a template pattern that is generated from training data set. For template matching, the $5{\times}5$ template block rank pattern is prior constructed by averaging block ranks using training data set. The proposed algorithm is tested on real images obtained by Kinect range sensor. Experimental results show that the proposed algorithm effectively eliminates most false positives with true positives well preserved.

Design of Pattern Classifier for Electrical and Electronic Waste Plastic Devices Using LIBS Spectrometer (LIBS 분광기를 이용한 폐소형가전 플라스틱 패턴 분류기의 설계)

  • Park, Sang-Beom;Bae, Jong-Soo;Oh, Sung-Kwun;Kim, Hyun-Ki
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.26 no.6
    • /
    • pp.477-484
    • /
    • 2016
  • Small industrial appliances such as fan, audio, electric rice cooker mostly consist of ABS, PP, PS materials. In colored plastics, it is possible to classify by near infrared(NIR) spectroscopy, while in black plastics, it is very difficult to classify black plastic because of the characteristic of black material that absorbs the light. So the RBFNNs pattern classifier is introduced for sorting electrical and electronic waste plastics through LIBS(Laser Induced Breakdown Spectroscopy) spectrometer. At the preprocessing part, PCA(Principle Component Analysis), as a kind of dimension reduction algorithms, is used to improve processing speed as well as to extract the effective data characteristics. In the condition part, FCM(Fuzzy C-Means) clustering is exploited. In the conclusion part, the coefficients of linear function of being polynomial type are used as connection weights. PSO and 5-fold cross validation are used to improve the reliability of performance as well as to enhance classification rate. The performance of the proposed classifier is described based on both optimization and no optimization.