• Title/Summary/Keyword: datasets

Search Result 2,046, Processing Time 0.026 seconds

Performance of Distributed Database System built on Multicore Systems

  • Kim, Kangseok
    • Journal of Internet Computing and Services
    • /
    • v.18 no.6
    • /
    • pp.47-53
    • /
    • 2017
  • Recently, huge datasets have been generating rapidly in a variety of fields. Then, there is an urgent need for technologies that will allow efficient and effective processing of huge datasets. Therefore the problems of partitioning a huge dataset effectively and alleviating the processing overhead of the partitioned data efficiently have been a critical factor for scalability and performance in distributed database system. In our work we utilized multicore servers to provide scalable service to our distributed system. The partitioning of database over multicore servers have emerged from a need for new architectural design of distributed database system from scalability and performance concerns in today's data deluge. The system allows uniform access through a web service interface to concurrently distributed databases over multicore servers, using SQMD (Single Query Multiple Database) mechanism based on publish/subscribe paradigm. We will present performance results with the distributed database system built on multicore server, which is time intensive with traditional architectures. We will also discuss future works.

A Hierarchical Clustering Algorithm Using Extended Sequence Element-based Similarity Measure (확장된 시퀀스 요소 기반의 유사도를 이용한 계층적 클러스터링 알고리즘)

  • Oh, Seung-Joon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.11 no.5 s.43
    • /
    • pp.321-327
    • /
    • 2006
  • Recently there has been enormous growth in the amount of commercial and scientific data. Such datasets consist of sequence data that have an inherent sequential nature. However, only a few of the existing clustering algorithms consider sequentiality. This study presents a similarity measure and a method for clustering such sequence datasets. Especially, we present an extended concept of the measure of similarity, which considers various conditions. Using a splice dataset, we show that the quality of clusters generated by our proposed clustering algorithm is better than that of clusters produced by traditional clustering algorithms.

  • PDF

Three-dimensional Shape Recovery from Image Focus Using Polynomial Regression Analysis in Optical Microscopy

  • Lee, Sung-An;Lee, Byung-Geun
    • Current Optics and Photonics
    • /
    • v.4 no.5
    • /
    • pp.411-420
    • /
    • 2020
  • Non-contact three-dimensional (3D) measuring technology is used to identify defects in miniature products, such as optics, polymers, and semiconductors. Hence, this technology has garnered significant attention in computer vision research. In this paper, we focus on shape from focus (SFF), which is an optical passive method for 3D shape recovery. In existing SFF techniques using interpolation, all datasets of the focus volume are approximated using one model. However, these methods cannot demonstrate how a predefined model fits all image points of an object. Moreover, it is not reasonable to explain various shapes of datasets using one model. Furthermore, if noise is present in the dataset, an error will be generated. Therefore, we propose an algorithm based on polynomial regression analysis to address these disadvantages. Our experimental results indicate that the proposed method is more accurate than existing methods.

Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets

  • Kang, In-Su
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.5
    • /
    • pp.1-7
    • /
    • 2018
  • Reference string recognition is to extract individual reference strings from a reference section of an academic article, which consists of a sequence of reference lines. This task has been attacked by heuristic-based, clustering-based, classification-based approaches, exploiting lexical and layout characteristics of reference lines. Most classification-based methods have used sequence labeling to assign labels to either a sequence of tokens within reference lines, or a sequence of reference lines. Unlike the previous token-level sequence labeling approach, this study attempts to assign different labels to the beginning, intermediate and terminating tokens of a reference string. After that, post-processing is applied to identify reference strings by predicting their beginning and/or terminating tokens. Experimental evaluation using English and German reference string recognition datasets shows that the proposed method obtains above 94% in the macro-averaged F1.

Search for galaxy clusters in SA22

  • Kim, Jae-Woo;Im, Myungshin;Hyun, Minhee
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.37 no.2
    • /
    • pp.83.1-83.1
    • /
    • 2012
  • The galaxy cluster is a good laboratory to test the cosmological model as well as the evolution of galaxies in the dense region. However the lack of wide and deep near-IR datasets has prevented to identify galaxy clusters at z>1. Here we merge a wide, deep near-IR datasets of UKIDSS DXS (J and K bands) and IMS (J band) with the CFHT Legacy Survey (CFHTLS) ugriz catalogue to detect galaxy clusters. We identify candidate galaxy clusters at z>0.8, where the near-IR dataset plays an important role to detect galaxies efficiently. The cluster mass is also estimated based on the cluster richness and the semi-analytical cosmological simulation.

  • PDF

Pruning the Boosting Ensemble of Decision Trees

  • Yoon, Young-Joo;Song, Moon-Sup
    • Communications for Statistical Applications and Methods
    • /
    • v.13 no.2
    • /
    • pp.449-466
    • /
    • 2006
  • We propose to use variable selection methods based on penalized regression for pruning decision tree ensembles. Pruning methods based on LASSO and SCAD are compared with the cluster pruning method. Comparative studies are performed on some artificial datasets and real datasets. According to the results of comparative studies, the proposed methods based on penalized regression reduce the size of boosting ensembles without decreasing accuracy significantly and have better performance than the cluster pruning method. In terms of classification noise, the proposed pruning methods can mitigate the weakness of AdaBoost to some degree.

An Improved Semi-Empirical Model for Radar Backscattering from Rough Sea Surfaces at X-Band

  • Jin, Taekyeong;Oh, Yisok
    • Journal of electromagnetic engineering and science
    • /
    • v.18 no.2
    • /
    • pp.136-140
    • /
    • 2018
  • We propose an improved semi-empirical scattering model for X-band radar backscattering from rough sea surfaces. This new model has a wider validity range of wind speeds than does the existing semi-empirical sea spectrum (SESS) model. First, we retrieved the small-roughness parameters from the sea surfaces, which were numerically generated using the Pierson-Moskowitz spectrum and measurement datasets for various wind speeds. Then, we computed the backscattering coefficients of the small-roughness surfaces for various wind speeds using the integral equation method model. Finally, the large-roughness characteristics were taken into account by integrating the small-roughness backscattering coefficients multiplying them with the surface slope probability density function for all possible surface slopes. The new model includes a wind speed range below 3.46 m/s, which was not covered by the existing SESS model. The accuracy of the new model was verified with two measurement datasets for various wind speeds from 0.5 m/s to 14 m/s.

Enhanced Markov-Difference Based Power Consumption Prediction for Smart Grids

  • Le, Yiwen;He, Jinghan
    • Journal of Electrical Engineering and Technology
    • /
    • v.12 no.3
    • /
    • pp.1053-1063
    • /
    • 2017
  • Power prediction is critical to improve power efficiency in Smart Grids. Markov chain provides a useful tool for power prediction. With careful investigation of practical power datasets, we find an interesting phenomenon that the stochastic property of practical power datasets does not follow the Markov features. This mismatch affects the prediction accuracy if directly using Markov prediction methods. In this paper, we innovatively propose a spatial transform based data processing to alleviate this inconsistency. Furthermore, we propose an enhanced power prediction method, named by Spatial Mapping Markov-Difference (SMMD), to guarantee the prediction accuracy. In particular, SMMD adopts a second prediction adjustment based on the differential data to reduce the stochastic error. Experimental results validate that the proposed SMMD achieves an improvement in terms of the prediction accuracy with respect to state-of-the-art solutions.

ModifiedFAST: A New Optimal Feature Subset Selection Algorithm

  • Nagpal, Arpita;Gaur, Deepti
    • Journal of information and communication convergence engineering
    • /
    • v.13 no.2
    • /
    • pp.113-122
    • /
    • 2015
  • Feature subset selection is as a pre-processing step in learning algorithms. In this paper, we propose an efficient algorithm, ModifiedFAST, for feature subset selection. This algorithm is suitable for text datasets, and uses the concept of information gain to remove irrelevant and redundant features. A new optimal value of the threshold for symmetric uncertainty, used to identify relevant features, is found. The thresholds used by previous feature selection algorithms such as FAST, Relief, and CFS were not optimal. It has been proven that the threshold value greatly affects the percentage of selected features and the classification accuracy. A new performance unified metric that combines accuracy and the number of features selected has been proposed and applied in the proposed algorithm. It was experimentally shown that the percentage of selected features obtained by the proposed algorithm was lower than that obtained using existing algorithms in most of the datasets. The effectiveness of our algorithm on the optimal threshold was statistically validated with other algorithms.

Selectivity Estimation for Spatial Databases

  • Chi, Jeong-Hee;Lee, Jin-Yul;Ryu, Keun-Ho
    • Proceedings of the KSRS Conference
    • /
    • 2003.11a
    • /
    • pp.766-768
    • /
    • 2003
  • Selectivity estimation for spatial query is curial in Spatial Database Management Systems(SDBMS). Many works have been performed to estimate accurate selectivity. Although they deal with some problems such as false-count, multi-count arising from properties of spatial dataset, they can not get such effects in little memory space.* Therefore, we need to compress spatial dataset into little memory. In this paper, we propose a new technique called MW Histogram which is able to compress summary data and get reasonable results. Our method is based on two techniques:(a)MinSkew partitioning algorithm which deal with skewed spatial datasets. efficiently (b) Wavelet transformation which compression effect is proven. We evaluate our method via real datasets. The experimental result shows that the MW Histogram has the ability of providing estimates with low relative error and retaining the similar estimates even if memory space is small.

  • PDF